Post Graduate Program in Data Science and Business Analytics - Supervised Learning: Linear Regression of a fictious company, Cars4U, and their Sales Data.
From GreatLearning:
There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholds in this market.
In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones. Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market.
As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.
Explore and visualize the dataset.
Build a linear regression model to predict the prices of used cars.
Generate a set of insights and recommendations that will help the business.
The business does not know the market price of every used vechicle and the future of those prices and so it cannot come up with a pricing model that can effectively predict the price and therefore give good suggestions to customers on how much to spend.
The problem is also related to the Indian Market. Other influences from exterior markets will be ignored. This is an experiment. The background of the Indian car market would be good to know in this case.
The dataset aims to answer the following key questions:
What are the predictive variables actually affecting used car price?
What can we avoid looking at to save time?
What do these relationships tell us about how to consult the selling agent on how to price their used cars?
At a first glance of the dataset, I can say that I think that the most predictive variables on price will be Year, Name, and Kilometers_Driven. This is generally thought of as the questions to ask about a car: What make and model? What year or how only is it? How many miles does it have on it? These usually give us a ball-park estimate of the cars value.
used_cars_data.csv - given by Post Graduate Program in Data Science and Business Analytics. Contains information about fictious cars.
S.No. : Serial Number.
Name : Name of the car which includes Brand name and Model name.
Location : The location in which the car is being sold or is available for purchase (Cities).
Year : Manufacturing year of the car.
Kilometers_driven : The total kilometers driven in the car by the previous owner(s) in KM.
Fuel_Type : The type of fuel used by the car. (Petrol, Diesel, Electric, CNG, LPG).
Transmission : The type of transmission used by the car. (Automatic / Manual).
Owner : Type of ownership.
Mileage : The standard mileage offered by the car company in kmpl or km/kg or number of miles/kilometers a car can travel in one liter/kilogram of fuel.
Engine : The displacement volume of the engine in CC.
Power : The maximum power of the engine in bhp. From GreatLearning Expert:"BHP or Brake Horse Power is the unit of power of an engine which determines the acceleration and top speed of a vehicle. Higher the BHP higher the power of an engine."
Seats : The number of seats in the car.
New_Price : The price of a new car of the same model in INR Lakhs.(1 Lakh = $100, 000 INR) If it's lower than the used price then not worth it. *the use the ($) sign to denote monetary units and not USD.
Price : The price of the used car in INR Lakhs (1 Lakh = $100, 000 INR)
4. Model building - Linear Regression
5. Statistics
7. Inferences
# Install needed libraries
#!pip3 install pandas_profiling
!pip install scipy==1.6.1
!pip install jupyter_nbextensions_configurator
%load_ext nb_black
# Warning packages
import warnings
warnings.filterwarnings("ignore")
# import Python packages
import itertools
from itertools import cycle, chain
# Data analysis packages
import pylab
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np
# Data visualization packages
import seaborn as sns
import matplotlib.pyplot as plt
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
plt.rcParams["figure.figsize"] = [12, 8] # set default figure size to 10" x 5"
# for statistical packages
from sklearn.preprocessing import RobustScaler
from sklearn import linear_model # for working with polynomial processing
from sklearn.linear_model import LinearRegression # To build linear regression_model
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.stats.api as sms
from statsmodels.compat import lzip
from scipy import stats
import scipy.stats as stats
from scipy.stats import mode, norm, pearsonr, tmean # for a Pearson Correlation matrix
# Sklearn package's randomized data splitting function
from sklearn.model_selection import train_test_split
# To check model performance
from sklearn.metrics import (
mean_absolute_error,
mean_squared_error,
r2_score,
)
from sklearn.preprocessing import PolynomialFeatures # for adding features to the model
from sklearn.preprocessing import (
StandardScaler,
) # for standard scaling of normalized data (assumed for Linear Regression model)
import pylab # for making a probability plot
"""perform linear regression using *statsmodels*,
a Python module that provides functions for the estimation of many statistical models,
as well as for conducting statistical tests, and statistical data exploration.
"""
# limit or remove from the number of displayed columns and rows.
# This is so I can see the entire dataframe when I print it
pd.set_option("display.max_columns", None)
pd.set_option("display.max_rows", 200)
Requirement already satisfied: scipy==1.6.1 in c:\users\joe\anaconda3\lib\site-packages (1.6.1) Requirement already satisfied: numpy>=1.16.5 in c:\users\joe\anaconda3\lib\site-packages (from scipy==1.6.1) (1.19.5) Requirement already satisfied: jupyter_nbextensions_configurator in c:\users\joe\anaconda3\lib\site-packages (0.4.1) Requirement already satisfied: jupyter-contrib-core>=0.3.3 in c:\users\joe\anaconda3\lib\site-packages (from jupyter_nbextensions_configurator) (0.3.3) Requirement already satisfied: jupyter-core in c:\users\joe\appdata\roaming\python\python38\site-packages (from jupyter_nbextensions_configurator) (4.7.0) Requirement already satisfied: traitlets in c:\users\joe\appdata\roaming\python\python38\site-packages (from jupyter_nbextensions_configurator) (5.0.5) Requirement already satisfied: tornado in c:\users\joe\appdata\roaming\python\python38\site-packages (from jupyter_nbextensions_configurator) (6.1) Requirement already satisfied: pyyaml in c:\users\joe\anaconda3\lib\site-packages (from jupyter_nbextensions_configurator) (5.4.1) Requirement already satisfied: notebook>=4.0 in c:\users\joe\anaconda3\lib\site-packages (from jupyter_nbextensions_configurator) (6.3.0) Requirement already satisfied: setuptools in c:\users\joe\anaconda3\lib\site-packages (from jupyter-contrib-core>=0.3.3->jupyter_nbextensions_configurator) (52.0.0.post20210125) Requirement already satisfied: nbconvert in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (6.0.7) Requirement already satisfied: ipykernel in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (5.3.4) Requirement already satisfied: terminado>=0.8.3 in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (0.9.4) Requirement already satisfied: jinja2 in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (2.11.3) Requirement already satisfied: prometheus-client in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (0.10.1) Requirement already satisfied: nbformat in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (5.1.3) Requirement already satisfied: ipython-genutils in c:\users\joe\appdata\roaming\python\python38\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (0.2.0) Requirement already satisfied: Send2Trash>=1.5.0 in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (1.5.0) Requirement already satisfied: pyzmq>=17 in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (20.0.0) Requirement already satisfied: jupyter-client>=5.3.4 in c:\users\joe\appdata\roaming\python\python38\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (6.1.7) Requirement already satisfied: argon2-cffi in c:\users\joe\anaconda3\lib\site-packages (from notebook>=4.0->jupyter_nbextensions_configurator) (20.1.0) Requirement already satisfied: python-dateutil>=2.1 in c:\users\joe\anaconda3\lib\site-packages (from jupyter-client>=5.3.4->notebook>=4.0->jupyter_nbextensions_configurator) (2.8.1) Requirement already satisfied: pywin32>=1.0 in c:\users\joe\anaconda3\lib\site-packages (from jupyter-core->jupyter_nbextensions_configurator) (227) Requirement already satisfied: six>=1.5 in c:\users\joe\anaconda3\lib\site-packages (from python-dateutil>=2.1->jupyter-client>=5.3.4->notebook>=4.0->jupyter_nbextensions_configurator) (1.15.0) Requirement already satisfied: pywinpty>=0.5 in c:\users\joe\anaconda3\lib\site-packages (from terminado>=0.8.3->notebook>=4.0->jupyter_nbextensions_configurator) (0.5.7) Requirement already satisfied: cffi>=1.0.0 in c:\users\joe\anaconda3\lib\site-packages (from argon2-cffi->notebook>=4.0->jupyter_nbextensions_configurator) (1.14.5) Requirement already satisfied: pycparser in c:\users\joe\anaconda3\lib\site-packages (from cffi>=1.0.0->argon2-cffi->notebook>=4.0->jupyter_nbextensions_configurator) (2.20) Requirement already satisfied: ipython>=5.0.0 in c:\users\joe\anaconda3\lib\site-packages (from ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (7.22.0) Requirement already satisfied: pygments in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (2.7.3) Requirement already satisfied: jedi>=0.16 in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.17.2) Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (3.0.8) Requirement already satisfied: colorama in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.4.4) Requirement already satisfied: backcall in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.2.0) Requirement already satisfied: pickleshare in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.7.5) Requirement already satisfied: decorator in c:\users\joe\appdata\roaming\python\python38\site-packages (from ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (4.4.2) Requirement already satisfied: parso<0.8.0,>=0.7.0 in c:\users\joe\appdata\roaming\python\python38\site-packages (from jedi>=0.16->ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.7.1) Requirement already satisfied: wcwidth in c:\users\joe\appdata\roaming\python\python38\site-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=5.0.0->ipykernel->notebook>=4.0->jupyter_nbextensions_configurator) (0.2.5) Requirement already satisfied: MarkupSafe>=0.23 in c:\users\joe\anaconda3\lib\site-packages (from jinja2->notebook>=4.0->jupyter_nbextensions_configurator) (1.1.1) Requirement already satisfied: nbclient<0.6.0,>=0.5.0 in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.5.3) Requirement already satisfied: pandocfilters>=1.4.1 in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (1.4.3) Requirement already satisfied: jupyterlab-pygments in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.1.2) Requirement already satisfied: defusedxml in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.7.1) Requirement already satisfied: mistune<2,>=0.8.1 in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.8.4) Requirement already satisfied: entrypoints>=0.2.2 in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.3) Requirement already satisfied: testpath in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.4.4) Requirement already satisfied: bleach in c:\users\joe\anaconda3\lib\site-packages (from nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (3.3.0) Requirement already satisfied: nest-asyncio in c:\users\joe\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (1.5.1) Requirement already satisfied: async-generator in c:\users\joe\anaconda3\lib\site-packages (from nbclient<0.6.0,>=0.5.0->nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (1.10) Requirement already satisfied: jsonschema!=2.5.0,>=2.4 in c:\users\joe\anaconda3\lib\site-packages (from nbformat->notebook>=4.0->jupyter_nbextensions_configurator) (3.2.0) Requirement already satisfied: attrs>=17.4.0 in c:\users\joe\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=4.0->jupyter_nbextensions_configurator) (20.3.0) Requirement already satisfied: pyrsistent>=0.14.0 in c:\users\joe\anaconda3\lib\site-packages (from jsonschema!=2.5.0,>=2.4->nbformat->notebook>=4.0->jupyter_nbextensions_configurator) (0.17.3) Requirement already satisfied: webencodings in c:\users\joe\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (0.5.1) Requirement already satisfied: packaging in c:\users\joe\anaconda3\lib\site-packages (from bleach->nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (20.9) Requirement already satisfied: pyparsing>=2.0.2 in c:\users\joe\anaconda3\lib\site-packages (from packaging->bleach->nbconvert->notebook>=4.0->jupyter_nbextensions_configurator) (2.4.7)
# check scipy version for 1.6.1 for using attributes of testing functions
import scipy
scipy.__version__
'1.6.1'
S.No. of the vehicle is not giving us additional useful information about the used cars that we will need at this level of analysis (trend and prediction) and is acting effectively as an index, which we already have, so I will set it up as the Index in the read_csv.
# import dataset and make a dataframe
df = pd.read_csv("used_cars_data.csv", index_col="S.No.")
pd.set_option(
"display.float_format", lambda x: "%.2f" % x
) # To supress numerical display in scientific notations
# Create a copy of the dataframe
df = df.copy()
# View shape of the dataframe
print(f"The dataframe shape is {df.shape}")
# I'm now going to look at 10 random rows
# I'm setting the random seed via np.random.seed so that
# I get the same random results every time
np.random.seed(1)
df.sample(n=10)
The dataframe shape is (7253, 13)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 2397 | Ford EcoSport 1.5 Petrol Trend | Kolkata | 2016 | 21460 | Petrol | Manual | First | 17.0 kmpl | 1497 CC | 121.36 bhp | 5.00 | 9.47 Lakh | 6.00 |
| 3777 | Maruti Wagon R VXI 1.2 | Kochi | 2015 | 49818 | Petrol | Manual | First | 21.5 kmpl | 1197 CC | 81.80 bhp | 5.00 | 5.44 Lakh | 4.11 |
| 4425 | Ford Endeavour 4x2 XLT | Hyderabad | 2007 | 130000 | Diesel | Manual | First | 13.1 kmpl | 2499 CC | 141 bhp | 7.00 | NaN | 6.00 |
| 3661 | Mercedes-Benz E-Class E250 CDI Avantgrade | Coimbatore | 2016 | 39753 | Diesel | Automatic | First | 13.0 kmpl | 2143 CC | 201.1 bhp | 5.00 | NaN | 35.28 |
| 4514 | Hyundai Xcent 1.2 Kappa AT SX Option | Kochi | 2016 | 45560 | Petrol | Automatic | First | 16.9 kmpl | 1197 CC | 82 bhp | 5.00 | NaN | 6.34 |
| 599 | Toyota Innova Crysta 2.8 ZX AT | Coimbatore | 2019 | 40674 | Diesel | Automatic | First | 11.36 kmpl | 2755 CC | 171.5 bhp | 7.00 | 28.05 Lakh | 24.82 |
| 186 | Mercedes-Benz E-Class E250 CDI Avantgrade | Bangalore | 2014 | 37382 | Diesel | Automatic | First | 13.0 kmpl | 2143 CC | 201.1 bhp | 5.00 | NaN | 32.00 |
| 305 | Audi A6 2011-2015 2.0 TDI Premium Plus | Kochi | 2014 | 61726 | Diesel | Automatic | First | 17.68 kmpl | 1968 CC | 174.33 bhp | 5.00 | NaN | 20.77 |
| 4582 | Hyundai i20 1.2 Magna | Kolkata | 2011 | 36000 | Petrol | Manual | First | 18.5 kmpl | 1197 CC | 80 bhp | 5.00 | NaN | 2.50 |
| 5434 | Honda WR-V Edge Edition i-VTEC S | Kochi | 2019 | 13913 | Petrol | Manual | First | 17.5 kmpl | 1199 CC | 88.7 bhp | 5.00 | 9.36 Lakh | 8.20 |
Observations:
Name column.Mileage, Power & New_Price has numeric and string values that will need to be processed.New_Price has many missing values and we will need to handle this.It looks lke we are looking at a lot of string values that are not good for modeling in a linear regression model.
# View top 5 rows of dataframe
df.head()
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.6 km/kg | 998 CC | 58.16 bhp | 5.00 | NaN | 1.75 |
| 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 kmpl | 1582 CC | 126.2 bhp | 5.00 | NaN | 12.50 |
| 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.2 kmpl | 1199 CC | 88.7 bhp | 5.00 | 8.61 Lakh | 4.50 |
| 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 kmpl | 1248 CC | 88.76 bhp | 7.00 | NaN | 6.00 |
| 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.2 kmpl | 1968 CC | 140.8 bhp | 5.00 | NaN | 17.74 |
# View bottom 5 rows of dataframe
df.tail()
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 7248 | Volkswagen Vento Diesel Trendline | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 20.54 kmpl | 1598 CC | 103.6 bhp | 5.00 | NaN | NaN |
| 7249 | Volkswagen Polo GT TSI | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 17.21 kmpl | 1197 CC | 103.6 bhp | 5.00 | NaN | NaN |
| 7250 | Nissan Micra Diesel XV | Kolkata | 2012 | 28000 | Diesel | Manual | First | 23.08 kmpl | 1461 CC | 63.1 bhp | 5.00 | NaN | NaN |
| 7251 | Volkswagen Polo GT TSI | Pune | 2013 | 52262 | Petrol | Automatic | Third | 17.2 kmpl | 1197 CC | 103.6 bhp | 5.00 | NaN | NaN |
| 7252 | Mercedes-Benz E-Class 2009-2013 E 220 CDI Avan... | Kochi | 2014 | 72443 | Diesel | Automatic | First | 10.0 kmpl | 2148 CC | 170 bhp | 5.00 | NaN | NaN |
Observations:
Looks like there are both NaN and nan values present. This will need to get handled separately.
Getting all the features into numeric form for modeling.
# get dataframe shape
df.shape
(7253, 13)
# get dtypes
df.dtypes
Name object Location object Year int64 Kilometers_Driven int64 Fuel_Type object Transmission object Owner_Type object Mileage object Engine object Power object Seats float64 New_Price object Price float64 dtype: object
# Get the information on the df
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7253 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null object 1 Location 7253 non-null object 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null object 5 Transmission 7253 non-null object 6 Owner_Type 7253 non-null object 7 Mileage 7251 non-null object 8 Engine 7207 non-null object 9 Power 7207 non-null object 10 Seats 7200 non-null float64 11 New_Price 1006 non-null object 12 Price 6019 non-null float64 dtypes: float64(2), int64(2), object(9) memory usage: 793.3+ KB
Observations:
Many different dtypes that will need to be checked through and maybe changed. Looking at the samples above we can say that the S.No.,Kilometers_Driven,Engine, and Seats, is meant clearly to be an int64, while Name, Location``Fuel_Type,Owner_Type, Year, and Transmission, are string categoricals, and Mileage,Power, are floats. New_Price and Price since they are given a conversion rate in the description, I can decide to conver them to not. Given that the other units are metric, I will keep the dollar value to the Indian currency. This will leave them as dtype=float.
Insights:
Transmission column has 2 unique values - "Manual" and "Automatic"Engine column has 150 unique values, i.e., the data is collected from 150 Engine sizes.Year ranges from 1996 to 2019.Power is 112.31 CC's.# Get the dataframe columns list
df.columns
Index(['Name', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type',
'Transmission', 'Owner_Type', 'Mileage', 'Engine', 'Power', 'Seats',
'New_Price', 'Price'],
dtype='object')
cat_vars = [
"Name",
"Location",
"Fuel_Type",
"Transmission",
"Owner_Type",
# "Year",
]
for colname in cat_vars:
df[colname] = df[colname].astype("category")
df.info() # Explore dataframe information, check dtype, and nulls.
<class 'pandas.core.frame.DataFrame'> Int64Index: 7253 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null category 1 Location 7253 non-null category 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null category 5 Transmission 7253 non-null category 6 Owner_Type 7253 non-null category 7 Mileage 7251 non-null object 8 Engine 7207 non-null object 9 Power 7207 non-null object 10 Seats 7200 non-null float64 11 New_Price 1006 non-null object 12 Price 6019 non-null float64 dtypes: category(5), float64(2), int64(2), object(4) memory usage: 633.9+ KB
Observations:
Numerics
Before I can go on to do analysis I need to make sure the numeric types are set correctly by removing string values from them. These units of measurement are important for understanding the data so I will move them to the column headers to keep them top of mind when doing insights and recommendations.
# looking at value counts for non-categorical features
num_to_display = 10 # defining this up here so it's easy to change later if I want
for colname in df.dtypes[df.dtypes == "object"].index: # call the mixed-dtype class
val_counts = df[colname].value_counts(dropna=False) # i want to see NA counts
print(colname, "\n", "\n", val_counts[:num_to_display])
if len(val_counts) > num_to_display:
print(f"Only displaying first {num_to_display} of {len(val_counts)} values.")
print("\n\n") # just for more space
Mileage 17.0 kmpl 207 18.9 kmpl 201 18.6 kmpl 144 21.1 kmpl 106 20.36 kmpl 105 17.8 kmpl 98 18.0 kmpl 89 12.8 kmpl 87 18.5 kmpl 86 16.0 kmpl 85 Name: Mileage, dtype: int64 Only displaying first 10 of 451 values. Engine 1197 CC 732 1248 CC 610 1498 CC 370 998 CC 309 1198 CC 281 2179 CC 278 1497 CC 273 1968 CC 266 1995 CC 212 1461 CC 188 Name: Engine, dtype: int64 Only displaying first 10 of 151 values. Power 74 bhp 280 98.6 bhp 166 73.9 bhp 152 140 bhp 142 null bhp 129 78.9 bhp 128 67.1 bhp 126 67.04 bhp 125 82 bhp 124 88.5 bhp 120 Name: Power, dtype: int64 Only displaying first 10 of 387 values. New_Price NaN 6247 63.71 Lakh 6 33.36 Lakh 6 95.13 Lakh 6 4.78 Lakh 6 11.67 Lakh 5 20.74 Lakh 5 11.75 Lakh 5 4.98 Lakh 5 15.05 Lakh 5 Name: New_Price, dtype: int64 Only displaying first 10 of 626 values.
Observations:
Looking at our non-numeric features we can see better that there are many numerics buried there that will need to be extracted and converted into numeric types by cleaning up the units of measurement out of them.
Power has a value "null" which needs to be treated.
Power¶Power is in bhp. We need to split the values into numeric and string so we will have two columns.
# change missing values in Power to actualy NaN values.
df["Power"] = df["Power"].replace("null bhp", np.nan)
"null bhp" in df.Power
False
def floats(n):
"""In the Mileage and Power columns I'm replacing the terminal 'km/kg','kmpl', 'bhp' and with
the empty string and converting to a float. Non-strings are
np.nans and are kept as np.nans."""
if isinstance(n, str):
return float(n.replace(" bhp", ""))
else:
return np.nan
col_transforms = {"Power": floats}
# k is the key, so the column name here
# v is the value, which a function in this case and is
# either `height_to_num` or `weight_to_num`
for k, v in col_transforms.items():
df[k] = df[k].map(v)
# Look at the head
df.head()
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.6 km/kg | 998 CC | 58.16 | 5.00 | NaN | 1.75 |
| 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 kmpl | 1582 CC | 126.20 | 5.00 | NaN | 12.50 |
| 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.2 kmpl | 1199 CC | 88.70 | 5.00 | 8.61 Lakh | 4.50 |
| 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 kmpl | 1248 CC | 88.76 | 7.00 | NaN | 6.00 |
| 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.2 kmpl | 1968 CC | 140.80 | 5.00 | NaN | 17.74 |
Verify the unit matches the fuel type
# check that function worked
for i in col_transforms:
display(df[i].head())
S.No. 0 58.16 1 126.20 2 88.70 3 88.76 4 140.80 Name: Power, dtype: float64
# print dtypes
df.dtypes
Name category Location category Year int64 Kilometers_Driven int64 Fuel_Type category Transmission category Owner_Type category Mileage object Engine object Power float64 Seats float64 New_Price object Price float64 dtype: object
Engine¶# Strip and create new column
df["Engine_num"] = (
df["Engine"].apply(lambda x: x.replace("CC", "").strip() if type(x) == str else x)
).astype(float)
# check head
df.head()
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Engine | Power | Seats | New_Price | Price | Engine_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | ||||||||||||||
| 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.6 km/kg | 998 CC | 58.16 | 5.00 | NaN | 1.75 | 998.00 |
| 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 kmpl | 1582 CC | 126.20 | 5.00 | NaN | 12.50 | 1582.00 |
| 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.2 kmpl | 1199 CC | 88.70 | 5.00 | 8.61 Lakh | 4.50 | 1199.00 |
| 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 kmpl | 1248 CC | 88.76 | 7.00 | NaN | 6.00 | 1248.00 |
| 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.2 kmpl | 1968 CC | 140.80 | 5.00 | NaN | 17.74 | 1968.00 |
We don't need the Engine column anymore since we only want the numbers.
# drop column
df.drop(columns="Engine", inplace=True, axis=1)
Engine_num and Seats are a float that is really a Int64 dtype (real number vs a float). First handle the missingness in each to allow conversion. We will use mean replacement.
numeric_columns = ["Engine_num"]
# for every item in the list print the mean
for i in numeric_columns:
print(i, df[i].mean())
medianFiller = lambda x: x.fillna(
x.median()
) # use the median of the column to fill in na values
df[numeric_columns] = df[numeric_columns].apply(
medianFiller, axis=0
) # apply the median filling function
for i in numeric_columns:
print("After filling in medians", i, df[i].mean()) # print the post-function mean
Engine_num 1616.5734702372692 After filling in medians Engine_num 1615.7897421756516
It has only changed by ~1 unit which is enough to be used.
# check for na values
df["Engine_num"].isna().sum()
0
# dtype conversion
# df[["Engine_num","Seats"]].convert_dtypes(True,False,True,False,False)
df["Engine_num"] = df["Engine_num"].astype("int64", copy=False)
# check dtypes
df.dtypes
Name category Location category Year int64 Kilometers_Driven int64 Fuel_Type category Transmission category Owner_Type category Mileage object Power float64 Seats float64 New_Price object Price float64 Engine_num int64 dtype: object
It looks that Engine_num is correct. Now we need to look at Seats.
Seats¶# Get missing
df.Seats.isna().sum()
53
After checking for missing lets treat these values with standard mean replacement.
numeric_columns = ["Seats"]
# for every item in the list print the mean
for i in numeric_columns:
print(i, df[i].mean())
medianFiller = lambda x: x.fillna(
x.median()
) # use the median of the column to fill in na values
df[numeric_columns] = df[numeric_columns].apply(
medianFiller, axis=0
) # apply the median filling function
for i in numeric_columns:
print("After filling in medians", i, df[i].mean()) # print the post-function mean
Seats 5.279722222222222 After filling in medians Seats 5.277678202123259
Hardly changed the mean. We will use.
# check dtype
df.Seats = df.Seats.astype("int64", copy=False)
df.Seats.dtype
dtype('int64')
That has been fixed. Now we need to continue with the parsing of object columns.
Mileage¶For Mileage we need to get the units into another column and numerics into another still. kmpl - kilometers per litre - is used for petrol and diesel cars.
km/kg - kilometers per kg - is used for CNG and LPG based engines.
We can check this in our data by comparing it to the Fuel_Type variable.
# split the name on number vs string
df["Mileage_num"] = (
df["Mileage"].apply(
lambda x: x.replace(" kmpl", "").replace(" km/kg", "").strip()
if type(x) == str
else x
)
).astype(float)
# check the df
df.head()
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Power | Seats | New_Price | Price | Engine_num | Mileage_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | ||||||||||||||
| 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.6 km/kg | 58.16 | 5 | NaN | 1.75 | 998 | 26.60 |
| 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 kmpl | 126.20 | 5 | NaN | 12.50 | 1582 | 19.67 |
| 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.2 kmpl | 88.70 | 5 | 8.61 Lakh | 4.50 | 1199 | 18.20 |
| 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 kmpl | 88.76 | 7 | NaN | 6.00 | 1248 | 20.77 |
| 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.2 kmpl | 140.80 | 5 | NaN | 17.74 | 1968 | 15.20 |
# extract text
df["Mileage_unit"] = (
df["Mileage"].str.extract("([a-zA-Z ]+)", expand=False).str.strip()
).astype("category")
df.head()
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Mileage | Power | Seats | New_Price | Price | Engine_num | Mileage_num | Mileage_unit | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||||
| 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 26.6 km/kg | 58.16 | 5 | NaN | 1.75 | 998 | 26.60 | km |
| 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 19.67 kmpl | 126.20 | 5 | NaN | 12.50 | 1582 | 19.67 | kmpl |
| 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 18.2 kmpl | 88.70 | 5 | 8.61 Lakh | 4.50 | 1199 | 18.20 | kmpl |
| 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 20.77 kmpl | 88.76 | 7 | NaN | 6.00 | 1248 | 20.77 | kmpl |
| 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 15.2 kmpl | 140.80 | 5 | NaN | 17.74 | 1968 | 15.20 | kmpl |
# Let us check if the units correspond to the fuel types as expected.
df.groupby(by=["Fuel_Type", "Mileage_unit"]).size()
Fuel_Type Mileage_unit
CNG km 62
kmpl 0
Diesel km 0
kmpl 3852
Electric km 0
kmpl 0
LPG km 12
kmpl 0
Petrol km 0
kmpl 3325
dtype: int64
As expected, km/kg is for CNG/LPG cars and kmpl is for Petrol and Diesel cars.
Now that Mileage is separated we can drop the original column and the unit column knowing how to pair them to Fuel_Type.
# drop column
df.drop(labels=["Mileage", "Mileage_unit"], inplace=True, axis=1)
# check columns
df.columns
Index(['Name', 'Location', 'Year', 'Kilometers_Driven', 'Fuel_Type',
'Transmission', 'Owner_Type', 'Power', 'Seats', 'New_Price', 'Price',
'Engine_num', 'Mileage_num'],
dtype='object')
# check conversion
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 7253 entries, 0 to 7252 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Name 7253 non-null category 1 Location 7253 non-null category 2 Year 7253 non-null int64 3 Kilometers_Driven 7253 non-null int64 4 Fuel_Type 7253 non-null category 5 Transmission 7253 non-null category 6 Owner_Type 7253 non-null category 7 Power 7078 non-null float64 8 Seats 7253 non-null int64 9 New_Price 1006 non-null object 10 Price 6019 non-null float64 11 Engine_num 7253 non-null int64 12 Mileage_num 7251 non-null float64 dtypes: category(5), float64(3), int64(4), object(1) memory usage: 891.9+ KB
Remove unneeded car types: "Electric" is a new market and will sway the pricing.
df.Fuel_Type.value_counts()
Diesel 3852 Petrol 3325 CNG 62 LPG 12 Electric 2 Name: Fuel_Type, dtype: int64
# subset columns
elec = df.loc[df.Fuel_Type == "Electric"]
display(elec)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | New_Price | Price | Engine_num | Mileage_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 4446 | Mahindra E Verito D4 | Chennai | 2016 | 50000 | Electric | Automatic | First | 41.00 | 5 | 13.58 Lakh | 13.00 | 72 | NaN |
| 4904 | Toyota Prius 2009-2016 Z4 | Mumbai | 2011 | 44000 | Electric | Automatic | First | 73.00 | 5 | NaN | 12.75 | 1798 | NaN |
# drop rows
df.drop(labels=[4446, 4904], axis=0, inplace=True)
# check drop
df.loc[df.Fuel_Type == "Electric"]
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | New_Price | Price | Engine_num | Mileage_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. |
# check the drop
df.Fuel_Type.value_counts().to_frame()
| Fuel_Type | |
|---|---|
| Diesel | 3852 |
| Petrol | 3325 |
| CNG | 62 |
| LPG | 12 |
| Electric | 0 |
# update the Catgorical dtype of Fuel_Type (https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html)
# Values which are removed are replaced by np.nan.
display(df.Fuel_Type.cat.categories)
df.Fuel_Type = df.Fuel_Type.cat.remove_categories(["Electric"])
display(df.Fuel_Type.cat.categories)
Index(['CNG', 'Diesel', 'Electric', 'LPG', 'Petrol'], dtype='object')
Index(['CNG', 'Diesel', 'LPG', 'Petrol'], dtype='object')
Mileage may have odd values in it. Let's check
Extreme values in Mileage
# check the minimum
df.Mileage_num.min()
0.0
# get subset
mil = df.loc[df.Mileage_num == df.Mileage_num.min()]
display(mil)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | New_Price | Price | Engine_num | Mileage_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 14 | Land Rover Freelander 2 TD4 SE | Pune | 2012 | 85000 | Diesel | Automatic | Second | 115.00 | 5 | NaN | 17.50 | 2179 | 0.00 |
| 67 | Mercedes-Benz C-Class Progressive C 220d | Coimbatore | 2019 | 15369 | Diesel | Automatic | First | 194.00 | 5 | 49.14 Lakh | 35.67 | 1950 | 0.00 |
| 79 | Hyundai Santro Xing XL | Hyderabad | 2005 | 87591 | Petrol | Manual | First | NaN | 5 | NaN | 1.30 | 1086 | 0.00 |
| 194 | Honda City 1.5 GXI | Ahmedabad | 2007 | 60006 | Petrol | Manual | First | NaN | 5 | NaN | 2.95 | 1493 | 0.00 |
| 229 | Ford Figo Diesel | Bangalore | 2015 | 70436 | Diesel | Manual | First | 99.00 | 5 | NaN | 3.60 | 1498 | 0.00 |
| 262 | Hyundai Santro Xing XL | Hyderabad | 2006 | 99000 | Petrol | Manual | First | NaN | 5 | NaN | 1.75 | 1086 | 0.00 |
| 307 | Hyundai Santro Xing XL | Chennai | 2006 | 58000 | Petrol | Manual | Second | NaN | 5 | NaN | 1.50 | 1086 | 0.00 |
| 424 | Volkswagen Jetta 2007-2011 1.9 L TDI | Hyderabad | 2010 | 42021 | Diesel | Manual | First | NaN | 5 | NaN | 5.45 | 1968 | 0.00 |
| 443 | Hyundai Santro GLS I - Euro I | Coimbatore | 2012 | 50243 | Petrol | Manual | First | NaN | 5 | NaN | 3.35 | 1086 | 0.00 |
| 544 | Mercedes-Benz New C-Class Progressive C 200 | Kochi | 2019 | 13190 | Petrol | Automatic | First | 181.43 | 5 | 49.49 Lakh | 38.99 | 1950 | 0.00 |
| 631 | Hyundai Santro LS zipPlus | Chennai | 2002 | 70000 | Petrol | Manual | Third | NaN | 5 | NaN | 0.65 | 1086 | 0.00 |
| 647 | Hyundai Santro Xing XP | Jaipur | 2004 | 200000 | Petrol | Manual | First | NaN | 5 | NaN | 0.80 | 1086 | 0.00 |
| 707 | Mercedes-Benz M-Class ML 350 4Matic | Pune | 2014 | 120000 | Diesel | Automatic | First | 165.00 | 5 | NaN | 30.00 | 2987 | 0.00 |
| 749 | Land Rover Range Rover 3.0 D | Mumbai | 2008 | 55001 | Diesel | Automatic | Second | NaN | 5 | NaN | 26.50 | 1493 | 0.00 |
| 915 | Smart Fortwo CDI AT | Pune | 2008 | 103000 | Diesel | Automatic | Second | NaN | 2 | NaN | 3.00 | 799 | 0.00 |
| 962 | Mercedes-Benz C-Class Progressive C 220d | Mumbai | 2018 | 8682 | Diesel | Automatic | First | 194.00 | 5 | 52.26 Lakh | 39.50 | 1950 | 0.00 |
| 996 | Hyundai Santro Xing GL | Pune | 2008 | 93000 | Petrol | Manual | First | 62.00 | 5 | NaN | 1.45 | 1086 | 0.00 |
| 1059 | Hyundai Santro Xing GL | Hyderabad | 2010 | 58163 | Petrol | Manual | First | 62.00 | 5 | NaN | 2.45 | 1086 | 0.00 |
| 1259 | Land Rover Freelander 2 TD4 S | Bangalore | 2010 | 125000 | Diesel | Automatic | Second | 115.00 | 5 | NaN | 11.00 | 2179 | 0.00 |
| 1271 | Hyundai Santro GLS I - Euro II | Jaipur | 2009 | 89000 | Petrol | Manual | Second | NaN | 5 | NaN | 1.60 | 999 | 0.00 |
| 1308 | Mercedes-Benz M-Class ML 350 4Matic | Bangalore | 2014 | 33000 | Diesel | Automatic | Second | 165.00 | 5 | NaN | 43.00 | 2987 | 0.00 |
| 1345 | Maruti Baleno Vxi | Pune | 2005 | 70000 | Petrol | Manual | First | NaN | 5 | NaN | 1.30 | 1590 | 0.00 |
| 1354 | Hyundai Santro Xing GL | Kochi | 2011 | 20842 | Petrol | Manual | First | 62.00 | 5 | NaN | 2.78 | 1086 | 0.00 |
| 1385 | Honda City 1.5 GXI | Pune | 2004 | 115000 | Petrol | Manual | Second | NaN | 5 | NaN | 1.50 | 1493 | 0.00 |
| 1419 | Hyundai Santro Xing XL | Chennai | 2007 | 82000 | Petrol | Manual | Second | NaN | 5 | NaN | 1.35 | 1086 | 0.00 |
| 1460 | Land Rover Range Rover Sport 2005 2012 Sport | Coimbatore | 2008 | 69078 | Petrol | Manual | First | NaN | 5 | NaN | 40.88 | 1493 | 0.00 |
| 1764 | Mercedes-Benz M-Class ML 350 4Matic | Pune | 2015 | 69000 | Diesel | Automatic | First | 165.00 | 5 | NaN | 38.00 | 2987 | 0.00 |
| 1857 | Hyundai Santro DX | Hyderabad | 2007 | 96000 | Petrol | Manual | Second | NaN | 5 | NaN | 2.20 | 999 | 0.00 |
| 2053 | Mahindra Jeep MM 550 PE | Hyderabad | 2009 | 26000 | Diesel | Manual | First | NaN | 6 | NaN | 6.99 | 2498 | 0.00 |
| 2096 | Hyundai Santro LP zipPlus | Coimbatore | 2004 | 52146 | Petrol | Manual | First | NaN | 5 | NaN | 1.93 | 1493 | 0.00 |
| 2130 | Hyundai Santro GLS I - Euro II | Coimbatore | 2012 | 51019 | Petrol | Manual | First | NaN | 5 | NaN | 3.48 | 999 | 0.00 |
| 2267 | Toyota Qualis RS E2 | Pune | 2004 | 215750 | Diesel | Manual | Second | NaN | 10 | NaN | 3.50 | 2446 | 0.00 |
| 2343 | Hyundai Santro AT | Hyderabad | 2006 | 74483 | Petrol | Automatic | First | NaN | 5 | NaN | 2.30 | 999 | 0.00 |
| 2542 | Hyundai Santro GLS II - Euro II | Bangalore | 2011 | 65000 | Petrol | Manual | Second | NaN | 5 | NaN | 3.15 | 1493 | 0.00 |
| 2597 | Hyundai Santro Xing XP | Pune | 2007 | 70000 | Petrol | Manual | First | NaN | 5 | NaN | 1.12 | 1086 | 0.00 |
| 2681 | Skoda Superb 3.6 V6 FSI | Hyderabad | 2010 | 54000 | Petrol | Automatic | First | 262.60 | 5 | NaN | 6.00 | 3597 | 0.00 |
| 2780 | Hyundai Santro GLS II - Euro II | Pune | 2009 | 100000 | Petrol | Manual | First | NaN | 5 | NaN | 1.60 | 1493 | 0.00 |
| 2842 | Hyundai Santro GLS II - Euro II | Bangalore | 2012 | 43000 | Petrol | Manual | First | NaN | 5 | NaN | 3.25 | 1493 | 0.00 |
| 3033 | Hyundai Santro Xing XP | Jaipur | 2005 | 120000 | Petrol | Manual | First | NaN | 5 | NaN | 1.15 | 1086 | 0.00 |
| 3044 | Hyundai Santro Xing GL | Kolkata | 2009 | 60170 | Petrol | Manual | First | 62.00 | 5 | NaN | 1.15 | 1086 | 0.00 |
| 3061 | Hyundai Santro GS | Ahmedabad | 2005 | 58000 | Petrol | Manual | Second | NaN | 5 | NaN | 1.51 | 999 | 0.00 |
| 3093 | Audi A7 2011-2015 Sportback | Kolkata | 2012 | 24720 | Diesel | Automatic | Second | 241.40 | 5 | NaN | 28.00 | 2967 | 0.00 |
| 3189 | Hyundai Santro GS zipDrive - Euro II | Chennai | 2002 | 67000 | Petrol | Manual | Third | NaN | 5 | NaN | 1.20 | 999 | 0.00 |
| 3210 | Mercedes-Benz M-Class ML 350 4Matic | Coimbatore | 2016 | 22769 | Diesel | Automatic | First | 165.00 | 5 | NaN | 49.22 | 2987 | 0.00 |
| 3271 | Hyundai Santro Xing GL | Bangalore | 2008 | 35268 | Petrol | Manual | Second | 62.00 | 5 | NaN | 1.88 | 1086 | 0.00 |
| 3516 | Hyundai Santro GLS I - Euro I | Pune | 2011 | 65400 | Petrol | Manual | First | NaN | 5 | NaN | 2.10 | 1086 | 0.00 |
| 3522 | Hyundai Santro GLS II - Euro II | Kochi | 2012 | 66400 | Petrol | Manual | First | NaN | 5 | NaN | 2.66 | 1493 | 0.00 |
| 3645 | Hyundai Santro Xing XP | Bangalore | 2004 | 167000 | Petrol | Manual | First | NaN | 5 | NaN | 1.35 | 1086 | 0.00 |
| 4152 | Land Rover Range Rover 3.0 D | Mumbai | 2003 | 75000 | Diesel | Automatic | Second | NaN | 5 | NaN | 16.11 | 1493 | 0.00 |
| 4234 | Mercedes-Benz M-Class ML 350 4Matic | Chennai | 2012 | 63000 | Diesel | Automatic | First | 165.00 | 5 | NaN | 26.00 | 2987 | 0.00 |
| 4302 | Hyundai Santro Xing GL | Delhi | 2012 | 61449 | Petrol | Manual | First | 62.00 | 5 | NaN | 2.15 | 1086 | 0.00 |
| 4412 | Mercedes-Benz M-Class ML 350 4Matic | Coimbatore | 2016 | 27833 | Diesel | Automatic | First | 165.00 | 5 | NaN | 49.24 | 2987 | 0.00 |
| 4629 | Fiat Siena 1.2 ELX | Jaipur | 2001 | 70000 | Petrol | Manual | Third | NaN | 5 | NaN | 0.55 | 1242 | 0.00 |
| 4687 | Land Rover Freelander 2 TD4 SE | Jaipur | 2012 | 119203 | Diesel | Automatic | First | 115.00 | 5 | NaN | 16.50 | 2179 | 0.00 |
| 4704 | Mercedes-Benz M-Class ML 350 4Matic | Bangalore | 2015 | 20000 | Diesel | Automatic | First | 165.00 | 5 | NaN | 47.35 | 2987 | 0.00 |
| 5016 | Land Rover Freelander 2 TD4 HSE | Delhi | 2013 | 72000 | Diesel | Automatic | First | 115.00 | 5 | NaN | 15.50 | 2179 | 0.00 |
| 5022 | Land Rover Freelander 2 TD4 SE | Hyderabad | 2013 | 46000 | Diesel | Automatic | Second | 115.00 | 5 | NaN | 26.00 | 2179 | 0.00 |
| 5119 | Hyundai Santro Xing XP | Kolkata | 2008 | 45500 | Petrol | Manual | Second | NaN | 5 | NaN | 1.17 | 1086 | 0.00 |
| 5270 | Honda City 1.5 GXI | Bangalore | 2002 | 53000 | Petrol | Manual | Second | NaN | 5 | NaN | 1.85 | 1493 | 0.00 |
| 5311 | Land Rover Freelander 2 TD4 SE | Hyderabad | 2012 | 139000 | Diesel | Automatic | First | 115.00 | 5 | NaN | 16.75 | 2179 | 0.00 |
| 5374 | Mercedes-Benz M-Class ML 350 4Matic | Ahmedabad | 2012 | 66000 | Diesel | Automatic | First | 165.00 | 5 | NaN | 23.00 | 2987 | 0.00 |
| 5426 | Hyundai Santro Xing XL | Chennai | 2006 | 85000 | Petrol | Manual | Third | NaN | 5 | NaN | 1.30 | 1086 | 0.00 |
| 5529 | Hyundai Santro LP - Euro II | Chennai | 2005 | 105000 | Petrol | Manual | First | NaN | 5 | NaN | 1.75 | 999 | 0.00 |
| 5647 | Toyota Qualis Fleet A3 | Mumbai | 2001 | 227000 | Diesel | Manual | Fourth & Above | NaN | 8 | NaN | 2.20 | 2446 | 0.00 |
| 5875 | Mercedes-Benz C-Class Progressive C 220d | Ahmedabad | 2019 | 4000 | Diesel | Automatic | First | 194.00 | 5 | 49.14 Lakh | 35.00 | 1950 | 0.00 |
| 5943 | Mahindra Jeep MM 540 DP | Chennai | 2002 | 75000 | Diesel | Manual | First | NaN | 6 | NaN | 1.70 | 2112 | 0.00 |
| 5972 | Hyundai Santro Xing GL | Mumbai | 2008 | 65000 | Petrol | Manual | Second | 62.00 | 5 | NaN | 1.39 | 1086 | 0.00 |
| 6011 | Skoda Superb 3.6 V6 FSI | Hyderabad | 2009 | 53000 | Petrol | Automatic | First | 262.60 | 5 | NaN | 4.75 | 3597 | 0.00 |
| 6090 | Hyundai Santro Xing GL | Ahmedabad | 2013 | 63831 | Petrol | Manual | First | 62.00 | 5 | NaN | NaN | 1086 | 0.00 |
| 6093 | Hyundai Santro Xing XL | Bangalore | 2007 | 47000 | Petrol | Manual | Second | NaN | 5 | NaN | NaN | 1086 | 0.00 |
| 6177 | Mercedes-Benz M-Class ML 350 4Matic | Bangalore | 2012 | 37000 | Diesel | Automatic | First | 165.00 | 5 | NaN | NaN | 2987 | 0.00 |
| 6205 | Hyundai Santro Xing GL | Ahmedabad | 2007 | 78000 | Petrol | Manual | First | 62.00 | 5 | NaN | NaN | 1086 | 0.00 |
| 6439 | Hyundai Santro GLS I - Euro II | Bangalore | 2011 | 43189 | Petrol | Manual | First | NaN | 5 | NaN | NaN | 999 | 0.00 |
| 6454 | Hyundai Santro LS zipDrive Euro I | Chennai | 2002 | 120000 | Petrol | Manual | Fourth & Above | NaN | 5 | NaN | NaN | 999 | 0.00 |
| 6491 | Mercedes-Benz M-Class ML 350 4Matic | Coimbatore | 2016 | 22177 | Diesel | Automatic | First | 165.00 | 5 | NaN | NaN | 2987 | 0.00 |
| 6576 | Hyundai Santro LS zipPlus | Kolkata | 2002 | 80000 | Petrol | Manual | First | NaN | 5 | NaN | NaN | 1086 | 0.00 |
| 6633 | Mahindra TUV 300 P4 | Kolkata | 2016 | 27000 | Diesel | Manual | First | NaN | 5 | NaN | NaN | 1493 | 0.00 |
| 6697 | Hyundai Santro Xing XL | Jaipur | 2007 | 85000 | Petrol | Manual | Second | NaN | 5 | NaN | NaN | 1086 | 0.00 |
| 6857 | Land Rover Freelander 2 TD4 SE | Mumbai | 2011 | 87000 | Diesel | Automatic | First | 115.00 | 5 | NaN | NaN | 2179 | 0.00 |
| 6957 | Honda Jazz 2020 Petrol | Kochi | 2019 | 11574 | Petrol | Manual | First | 88.70 | 5 | NaN | NaN | 1199 | 0.00 |
| 7226 | Hyundai Santro Xing GL | Ahmedabad | 2014 | 41000 | Petrol | Manual | First | 62.00 | 5 | NaN | NaN | 1086 | 0.00 |
There should be no 0's mpkm. This could be a data-entry error or system mistake. Let's investigate further.
Let's treat them as missing and handle all missing values later.
# replace with np.nan
df["Mileage_num"] = df["Mileage_num"].replace(df.Mileage_num.min(), np.nan)
# subset with loc
df.loc[df.Mileage_num == df.Mileage_num.min()]
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | New_Price | Price | Engine_num | Mileage_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 5781 | Lamborghini Gallardo Coupe | Delhi | 2011 | 6500 | Petrol | Automatic | Third | 560.00 | 2 | NaN | 120.00 | 5204 | 6.40 |
There are lots of ways to handle missing values. I'm going to start by investigating the patterns in the missingness.
# looking at which columns have the most missing values
df.isna().sum().sort_values(ascending=False)
New_Price 6246 Price 1234 Power 175 Mileage_num 81 Name 0 Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Seats 0 Engine_num 0 dtype: int64
Calculate the percentage of the data points missing per variable and if any variable has more than 50% missing values in it those are generally not useful and dropping those variables instead is recommended.
# Find percentage missing per column
df.isna().mean().sort_values(ascending=False).round(4) * 100
New_Price 86.14 Price 17.02 Power 2.41 Mileage_num 1.12 Name 0.00 Location 0.00 Year 0.00 Kilometers_Driven 0.00 Fuel_Type 0.00 Transmission 0.00 Owner_Type 0.00 Seats 0.00 Engine_num 0.00 dtype: float64
# counting the number of missing values per row
df.isnull().sum(axis=1).value_counts()
1 5212 2 1142 0 819 3 72 4 6 dtype: int64
There are 32 rows with 3 missing values, and 1163 with atleast 2. This is quite messy still and will need correcting.
# print a df of the null values
null_data = df[df.isnull().any(axis=1)]
display(null_data)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | New_Price | Price | Engine_num | Mileage_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 58.16 | 5 | NaN | 1.75 | 998 | 26.60 |
| 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 126.20 | 5 | NaN | 12.50 | 1582 | 19.67 |
| 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 88.76 | 7 | NaN | 6.00 | 1248 | 20.77 |
| 4 | Audi A4 New 2.0 TDI Multitronic | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 140.80 | 5 | NaN | 17.74 | 1968 | 15.20 |
| 5 | Hyundai EON LPG Era Plus Option | Hyderabad | 2012 | 75000 | LPG | Manual | First | 55.20 | 5 | NaN | 2.35 | 814 | 21.10 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7248 | Volkswagen Vento Diesel Trendline | Hyderabad | 2011 | 89411 | Diesel | Manual | First | 103.60 | 5 | NaN | NaN | 1598 | 20.54 |
| 7249 | Volkswagen Polo GT TSI | Mumbai | 2015 | 59000 | Petrol | Automatic | First | 103.60 | 5 | NaN | NaN | 1197 | 17.21 |
| 7250 | Nissan Micra Diesel XV | Kolkata | 2012 | 28000 | Diesel | Manual | First | 63.10 | 5 | NaN | NaN | 1461 | 23.08 |
| 7251 | Volkswagen Polo GT TSI | Pune | 2013 | 52262 | Petrol | Automatic | Third | 103.60 | 5 | NaN | NaN | 1197 | 17.20 |
| 7252 | Mercedes-Benz E-Class 2009-2013 E 220 CDI Avan... | Kochi | 2014 | 72443 | Diesel | Automatic | First | 170.00 | 5 | NaN | NaN | 2148 | 10.00 |
6432 rows × 13 columns
Re-check missing after dropping.
# create a list of columns with empty, or na values
nulls = df.columns[df.isna().any().tolist()]
# looking at which columns have the most missing values
df.isna().sum().sort_values(ascending=False)
New_Price 6246 Price 1234 Power 175 Mileage_num 81 Name 0 Location 0 Year 0 Kilometers_Driven 0 Fuel_Type 0 Transmission 0 Owner_Type 0 Seats 0 Engine_num 0 dtype: int64
Observations
# Find percentage missing per column
df.isna().mean().sort_values(ascending=False).round(4) * 100
New_Price 86.14 Price 17.02 Power 2.41 Mileage_num 1.12 Name 0.00 Location 0.00 Year 0.00 Kilometers_Driven 0.00 Fuel_Type 0.00 Transmission 0.00 Owner_Type 0.00 Seats 0.00 Engine_num 0.00 dtype: float64
Observations:
The New_Price column has a lot of missing values. These could be cars that do not have an equivalent model on the new-car market. Since I'm not particularly interested in this column as the description of the project states that these two markets are very different and I can't confidently say that there is a strong relationship between new car price and used car price (even though in common knowledge there is a jolt in used car sales when new car prices are high), I'll drop it for the sake of our analysis.
Power will need to be fixed as well.
New_Price
# Drop any unneeded columns
df.drop(["New_Price"], axis=1, inplace=True)
df.isna().mean().sort_values(ascending=False).round(4) * 100
Price 17.02 Power 2.41 Mileage_num 1.12 Name 0.00 Location 0.00 Year 0.00 Kilometers_Driven 0.00 Fuel_Type 0.00 Transmission 0.00 Owner_Type 0.00 Seats 0.00 Engine_num 0.00 dtype: float64
We will replace missing values in every needed column with its median increasing it's distortion <5% shouldnt matter (or drop all together).
Including Price here as opposed to dropping it or filling in with mean to not skew to outliers, even though it has 16% blanks, I want to keep as many data points as possible for the modeling.
# count the number missing
df.Price.isna().sum()
1234
numeric_columns = ["Power", "Price", "Mileage_num"]
for i in numeric_columns:
print(i, df[i].mean())
medianFiller = lambda x: x.fillna(x.median())
df[numeric_columns] = df[numeric_columns].apply(medianFiller, axis=0)
for i in numeric_columns:
print("After filling in medians", i, df[i].mean())
Power 112.78097583380348 Price 9.478339704171498 Mileage_num 18.346527196652747 After filling in medians Power 112.33591021927913 After filling in medians Price 8.825117914770264 After filling in medians Mileage_num 18.344890359950362
Very little difference after missing value filling. Price moved down .6 Lakhs.
# check for duplicated items
df.duplicated(keep=False).value_counts()
False 7249 True 2 dtype: int64
# dropping duplicate rows
df.drop_duplicates(keep=False, inplace=True)
df[df.duplicated(keep=False)] # check drop
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. |
No duplicates found.
# check the number of unique values in each column of the dataframe
df.nunique().sort_values(ascending=False)
Kilometers_Driven 3660 Name 2039 Price 1373 Mileage_num 437 Power 383 Engine_num 149 Year 23 Location 11 Seats 9 Fuel_Type 4 Owner_Type 4 Transmission 2 dtype: int64
# View the unique items in each row
for i in df.columns:
display(i, pd.DataFrame(df[i].unique()))
'Name'
| 0 | |
|---|---|
| 0 | Maruti Wagon R LXI CNG |
| 1 | Hyundai Creta 1.6 CRDi SX Option |
| 2 | Honda Jazz V |
| 3 | Maruti Ertiga VDI |
| 4 | Audi A4 New 2.0 TDI Multitronic |
| ... | ... |
| 2034 | Skoda Superb Petrol Ambition |
| 2035 | Tata Tiago 1.05 Revotorq XT Option |
| 2036 | Ford EcoSport 1.5 Petrol Ambiente |
| 2037 | Jeep Compass 1.4 Sport |
| 2038 | Hyundai Elite i20 Magna Plus |
2039 rows × 1 columns
'Location'
| 0 | |
|---|---|
| 0 | Mumbai |
| 1 | Pune |
| 2 | Chennai |
| 3 | Coimbatore |
| 4 | Hyderabad |
| 5 | Jaipur |
| 6 | Kochi |
| 7 | Kolkata |
| 8 | Delhi |
| 9 | Bangalore |
| 10 | Ahmedabad |
'Year'
| 0 | |
|---|---|
| 0 | 2010 |
| 1 | 2015 |
| 2 | 2011 |
| 3 | 2012 |
| 4 | 2013 |
| 5 | 2016 |
| 6 | 2018 |
| 7 | 2014 |
| 8 | 2017 |
| 9 | 2007 |
| 10 | 2009 |
| 11 | 2008 |
| 12 | 2019 |
| 13 | 2006 |
| 14 | 2005 |
| 15 | 2004 |
| 16 | 2002 |
| 17 | 2000 |
| 18 | 2003 |
| 19 | 1999 |
| 20 | 2001 |
| 21 | 1998 |
| 22 | 1996 |
'Kilometers_Driven'
| 0 | |
|---|---|
| 0 | 72000 |
| 1 | 41000 |
| 2 | 46000 |
| 3 | 87000 |
| 4 | 40670 |
| ... | ... |
| 3655 | 18242 |
| 3656 | 21190 |
| 3657 | 89411 |
| 3658 | 52262 |
| 3659 | 72443 |
3660 rows × 1 columns
'Fuel_Type'
| 0 | |
|---|---|
| 0 | CNG |
| 1 | Diesel |
| 2 | Petrol |
| 3 | LPG |
'Transmission'
| 0 | |
|---|---|
| 0 | Manual |
| 1 | Automatic |
'Owner_Type'
| 0 | |
|---|---|
| 0 | First |
| 1 | Second |
| 2 | Fourth & Above |
| 3 | Third |
'Power'
| 0 | |
|---|---|
| 0 | 58.16 |
| 1 | 126.20 |
| 2 | 88.70 |
| 3 | 88.76 |
| 4 | 140.80 |
| ... | ... |
| 378 | 147.40 |
| 379 | 328.50 |
| 380 | 98.63 |
| 381 | 360.00 |
| 382 | 241.60 |
383 rows × 1 columns
'Seats'
| 0 | |
|---|---|
| 0 | 5 |
| 1 | 7 |
| 2 | 8 |
| 3 | 4 |
| 4 | 6 |
| 5 | 2 |
| 6 | 10 |
| 7 | 9 |
| 8 | 0 |
'Price'
| 0 | |
|---|---|
| 0 | 1.75 |
| 1 | 12.50 |
| 2 | 4.50 |
| 3 | 6.00 |
| 4 | 17.74 |
| ... | ... |
| 1368 | 1.39 |
| 1369 | 30.54 |
| 1370 | 2.27 |
| 1371 | 17.56 |
| 1372 | 7.43 |
1373 rows × 1 columns
'Engine_num'
| 0 | |
|---|---|
| 0 | 998 |
| 1 | 1582 |
| 2 | 1199 |
| 3 | 1248 |
| 4 | 1968 |
| 5 | 814 |
| 6 | 1461 |
| 7 | 2755 |
| 8 | 1598 |
| 9 | 1462 |
| 10 | 1497 |
| 11 | 2179 |
| 12 | 2477 |
| 13 | 1498 |
| 14 | 2143 |
| 15 | 1995 |
| 16 | 1984 |
| 17 | 1197 |
| 18 | 2494 |
| 19 | 1798 |
| 20 | 2696 |
| 21 | 2698 |
| 22 | 1061 |
| 23 | 1198 |
| 24 | 2987 |
| 25 | 796 |
| 26 | 624 |
| 27 | 1999 |
| 28 | 1991 |
| 29 | 2694 |
| 30 | 1120 |
| 31 | 2498 |
| 32 | 799 |
| 33 | 2393 |
| 34 | 1399 |
| 35 | 1796 |
| 36 | 2148 |
| 37 | 1396 |
| 38 | 1950 |
| 39 | 4806 |
| 40 | 1998 |
| 41 | 1086 |
| 42 | 1193 |
| 43 | 2982 |
| 44 | 1493 |
| 45 | 2967 |
| 46 | 2993 |
| 47 | 1196 |
| 48 | 1799 |
| 49 | 2497 |
| 50 | 2354 |
| 51 | 1373 |
| 52 | 2996 |
| 53 | 1591 |
| 54 | 2894 |
| 55 | 5461 |
| 56 | 1595 |
| 57 | 936 |
| 58 | 1997 |
| 59 | 1896 |
| 60 | 1390 |
| 61 | 1364 |
| 62 | 2199 |
| 63 | 993 |
| 64 | 999 |
| 65 | 1405 |
| 66 | 2956 |
| 67 | 1794 |
| 68 | 995 |
| 69 | 2496 |
| 70 | 1599 |
| 71 | 2400 |
| 72 | 1495 |
| 73 | 2523 |
| 74 | 793 |
| 75 | 4134 |
| 76 | 1596 |
| 77 | 1395 |
| 78 | 2953 |
| 79 | 1586 |
| 80 | 2362 |
| 81 | 1496 |
| 82 | 1368 |
| 83 | 1298 |
| 84 | 1956 |
| 85 | 1299 |
| 86 | 3498 |
| 87 | 2835 |
| 88 | 1150 |
| 89 | 3198 |
| 90 | 1343 |
| 91 | 1499 |
| 92 | 1186 |
| 93 | 1590 |
| 94 | 2609 |
| 95 | 2499 |
| 96 | 2446 |
| 97 | 1978 |
| 98 | 2360 |
| 99 | 3436 |
| 100 | 2198 |
| 101 | 4367 |
| 102 | 2706 |
| 103 | 1422 |
| 104 | 2979 |
| 105 | 1969 |
| 106 | 1489 |
| 107 | 2489 |
| 108 | 1242 |
| 109 | 1388 |
| 110 | 1172 |
| 111 | 2495 |
| 112 | 1194 |
| 113 | 3200 |
| 114 | 1781 |
| 115 | 1341 |
| 116 | 2773 |
| 117 | 3597 |
| 118 | 1985 |
| 119 | 2147 |
| 120 | 1047 |
| 121 | 2999 |
| 122 | 2995 |
| 123 | 2997 |
| 124 | 1948 |
| 125 | 2359 |
| 126 | 4395 |
| 127 | 2349 |
| 128 | 2720 |
| 129 | 1468 |
| 130 | 3197 |
| 131 | 2487 |
| 132 | 1597 |
| 133 | 2771 |
| 134 | 4951 |
| 135 | 970 |
| 136 | 2925 |
| 137 | 2200 |
| 138 | 5000 |
| 139 | 2149 |
| 140 | 5998 |
| 141 | 2092 |
| 142 | 5204 |
| 143 | 2112 |
| 144 | 1797 |
| 145 | 2000 |
| 146 | 1795 |
| 147 | 3696 |
| 148 | 1389 |
'Mileage_num'
| 0 | |
|---|---|
| 0 | 26.60 |
| 1 | 19.67 |
| 2 | 18.20 |
| 3 | 20.77 |
| 4 | 15.20 |
| ... | ... |
| 432 | 11.88 |
| 433 | 12.08 |
| 434 | 15.56 |
| 435 | 14.50 |
| 436 | 17.15 |
437 rows × 1 columns
Observations:
Name has many unique values as expected.Transmission, Owner_Type, and Fuel_type that will make visual analysis simpler.Since we are using a Linear Regression model which assumes normality, I will transform variables with high skewness. H
If the skewness is between -0.5 and 0.5, the data are fairly symmetrical.
If the skewness is between -1 and – 0.5 or between 0.5 and 1, the data are moderately skewed.
If the skewness is less than -1 or greater than 1, the data are highly skewed.
# Subset the dataframe based on numeric and categorical types
numerical_features = df.select_dtypes(include=["int64", "float64"])
global numerical_features
[numerical_features]
[ Year Kilometers_Driven Power Seats Price Engine_num Mileage_num S.No. 0 2010 72000 58.16 5 1.75 998 26.60 1 2015 41000 126.20 5 12.50 1582 19.67 2 2011 46000 88.70 5 4.50 1199 18.20 3 2012 87000 88.76 7 6.00 1248 20.77 4 2013 40670 140.80 5 17.74 1968 15.20 ... ... ... ... ... ... ... ... 7248 2011 89411 103.60 5 5.64 1598 20.54 7249 2015 59000 103.60 5 5.64 1197 17.21 7250 2012 28000 63.10 5 5.64 1461 23.08 7251 2013 52262 103.60 5 5.64 1197 17.20 7252 2014 72443 170.00 5 5.64 2148 10.00 [7249 rows x 7 columns]]
# get skew
numerical_features.skew()
Year -0.84 Kilometers_Driven 61.57 Power 2.00 Seats 1.91 Price 3.73 Engine_num 1.42 Mileage_num 0.21 dtype: float64
# get normality check: std of 1 and mean of 0
display("mean", numerical_features.mean())
display("std", numerical_features.std())
'mean'
Year 2013.37 Kilometers_Driven 58704.14 Power 112.33 Seats 5.28 Price 8.83 Engine_num 1616.01 Mileage_num 18.35 dtype: float64
'std'
Year 3.25 Kilometers_Driven 84450.70 Power 52.93 Seats 0.81 Price 10.30 Engine_num 593.35 Mileage_num 4.14 dtype: float64
cols_to_log = [
"Year",
"Kilometers_Driven",
"Mileage_num",
"Engine_num",
"Power",
"Seats",
]
# print histogram for each column
for colname in cols_to_log:
plt.hist(df[colname], bins=50)
plt.title(colname)
plt.show()
print(np.sum(df[colname] <= 0)) # check for values <0
0
0
0
0
0
1
Observations
Year, Engine_num, Powerand Kilometer_Driven have strong skewness.Mileage_num andSeats look fairly nicely normally distributed.Time to consider other transformations. Another option is to use np.arcsinh which is like the log for large values but handles negative and zero values as well.
Testing Transformations
high_skew = ["Year", "Kilometers_Driven", "Power", "Engine_num"]
med_skew = ["Mileage_num", "Seats"]
for i in high_skew:
# on higher skew
plt.hist(np.arcsinh(df[i]), 50)
display(i)
plt.title("arcsinh")
plt.show()
for i in med_skew:
# try a regular log transform
plt.hist(np.log(df[i] + 1), 50)
display(i)
plt.title("log")
plt.show()
# on weaker skewness
plt.hist(np.sqrt(df["Seats"]), 50)
plt.title("sqrt")
plt.show()
'Year'
'Kilometers_Driven'
'Power'
'Engine_num'
'Mileage_num'
'Seats'
Observations
Year mean has moved closer to 0 and our standard deviation closer to 1.
Price¶# look at the distribution
sns.distplot(df.Price).set_title("Distribution of Price")
Text(0.5, 1.0, 'Distribution of Price')
# create probability plot
stats.probplot(df.Price, dist="norm", plot=pylab)
plt.show()
Observations
Price does not follow a normal distribution.Decision: What would it look like if we transformed Price on a log-scale?
x_bar, sigma = norm.fit(df.Price)
print("\n mean = {:2f} and std-dev = {:2f}\n".format(x_bar, sigma))
mean = 8.825997 and std-dev = 10.295087
# plot the distribution with log-distribution
sns.histplot(np.log(df.Price), kde=True)
plt.title("log")
plt.legend(
["Normal Dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )".format(x_bar, sigma)],
loc="best",
)
plt.ylabel("Frequency")
plt.title("Distribution of Prices")
plt.show()
# QQ plot
fig = plt.figure()
res = stats.probplot(df.Price, plot=plt)
plt.show()
Observations:
The log transformation decreases the scale of the distributions, even with the huge range of Price. It seems the outliers caused the log-transformed distributions to still be a bit skewed, but it is closer to normal than the original distribution.
# Log Transformation has definitely helped in reducing the skew
# Creating a new column with the transformed variable.
df["price_log"] = np.log(df["Price"])
Notes from GreatLearning:
"An outlier is a data point that is distant from other similar points.
Outliers in the data can distort predictions and affect the accuracy so it's important to flag them for review. This is especially the case with regression models.
The challenge with outlier detection is determining if a point is truly a problem or simply a large value. If a point is genuine then it is very important to keep it in the data as otherwise we're removing the most interesting pdata points. Regardless, it is essential to understand their impact on our predictive models and statistics, and the extent to which a small number of outlying points are dominating the fit of the model (for example, the mean is much more sensitive to outliers than the median). It is left to the best judgement of the investigator to decide whether treating outliers is necessary and how to go about it. Knowledge of domain and impact of the business problem tend to drive this decision. "
Notes from GreatLearning:
"The z-transformation used earlier can flag a point as being far away from the mean. If the data are normally distributed then we expect the vast majority of points to be within 3 standard deviations of the mean, which corresponds to a z score with an absolute value of at most 3.
If the data are not normal, however, the situation is more complicated."
# get a sub-df of independent variables to transform
trans = df[["Year", "Kilometers_Driven", "Mileage_num", "Engine_num", "Power", "Seats"]]
def z_transform(x):
return (x - np.mean(x)) / np.std(x)
"""
for i in list(numerical_features):
plt.hist(z_transform(df[i].values))
plt.title('z-transformed normal data')
plt.show()
"""
for (column_name, column_data) in trans.iteritems():
plt.hist(z_transform(column_data))
plt.title('z-transformed normal data')
print(column_name)
plt.show()
Year
Kilometers_Driven
Mileage_num
Engine_num
Power
Seats
numerical_features = df.select_dtypes(include=["int64", "float64"])
# numerical_features
All of the points in each plot are drawn from the exact same distribution, so it's not fair to call any of the points outliers in the sense of there being bad data.
It looks like every column tested has outliers using this method.
But depending on the distribution in question, we may have almost all of the z-scores between -3 and 3 or instead there could be extremely large values.
# set parameters for -3 to +3 cut-off
upper = numerical_features.mean() + 3 * numerical_features.std()
lower = numerical_features.mean() - 3 * numerical_features.std()
# create a new dataframe with those outliers
out_df = df[(numerical_features < upper) & (numerical_features > lower)]
display(out_df)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 0 | NaN | NaN | 2010.00 | 72000.00 | NaN | NaN | NaN | 58.16 | 5.00 | 1.75 | 998.00 | 26.60 | 0.56 |
| 1 | NaN | NaN | 2015.00 | 41000.00 | NaN | NaN | NaN | 126.20 | 5.00 | 12.50 | 1582.00 | 19.67 | 2.53 |
| 2 | NaN | NaN | 2011.00 | 46000.00 | NaN | NaN | NaN | 88.70 | 5.00 | 4.50 | 1199.00 | 18.20 | 1.50 |
| 3 | NaN | NaN | 2012.00 | 87000.00 | NaN | NaN | NaN | 88.76 | 7.00 | 6.00 | 1248.00 | 20.77 | 1.79 |
| 4 | NaN | NaN | 2013.00 | 40670.00 | NaN | NaN | NaN | 140.80 | 5.00 | 17.74 | 1968.00 | 15.20 | 2.88 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 7248 | NaN | NaN | 2011.00 | 89411.00 | NaN | NaN | NaN | 103.60 | 5.00 | 5.64 | 1598.00 | 20.54 | 1.73 |
| 7249 | NaN | NaN | 2015.00 | 59000.00 | NaN | NaN | NaN | 103.60 | 5.00 | 5.64 | 1197.00 | 17.21 | 1.73 |
| 7250 | NaN | NaN | 2012.00 | 28000.00 | NaN | NaN | NaN | 63.10 | 5.00 | 5.64 | 1461.00 | 23.08 | 1.73 |
| 7251 | NaN | NaN | 2013.00 | 52262.00 | NaN | NaN | NaN | 103.60 | 5.00 | 5.64 | 1197.00 | 17.20 | 1.73 |
| 7252 | NaN | NaN | 2014.00 | 72443.00 | NaN | NaN | NaN | 170.00 | 5.00 | 5.64 | 2148.00 | 10.00 | 1.73 |
7249 rows × 13 columns
Now let's create a new column to do a z-score test with.
for i in numerical_features:
df["zscore"] = (df[i] - df[i].mean()) / df[i].std()
df.head(4)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | zscore | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | ||||||||||||||
| 0 | Maruti Wagon R LXI CNG | Mumbai | 2010 | 72000 | CNG | Manual | First | 58.16 | 5 | 1.75 | 998 | 26.60 | 0.56 | -1.57 |
| 1 | Hyundai Creta 1.6 CRDi SX Option | Pune | 2015 | 41000 | Diesel | Manual | First | 126.20 | 5 | 12.50 | 1582 | 19.67 | 2.53 | 0.90 |
| 2 | Honda Jazz V | Chennai | 2011 | 46000 | Petrol | Manual | First | 88.70 | 5 | 4.50 | 1199 | 18.20 | 1.50 | -0.38 |
| 3 | Maruti Ertiga VDI | Chennai | 2012 | 87000 | Diesel | Manual | First | 88.76 | 7 | 6.00 | 1248 | 20.77 | 1.79 | -0.02 |
Now let's look for rows that would need to be cut using this method.
display(df[df["zscore"] > 3].shape)
display(df[df["zscore"] < -3].shape)
(31, 14)
(12, 14)
We would lose a total of 22 rows using this method. I'd rather go column by column so I will drop this column.
df.drop("zscore", inplace=True, axis=1)
# Function from GreatLearning
"""
Another way to flag points as outliers is to compute the IQR,
which is the interval going from the 1st quartile to the 3rd quartile of the data in question,
and then flag a point for investigation if it is outside 1.5 * IQR.
"""
def frac_outside_1pt5_IQR(x):
length = 1.5 * np.diff(np.quantile(x, [0.25, 0.75]))
return np.mean(np.abs(x - np.median(x)) > length)
for (column_name, column_data) in numerical_features.iteritems():
print(column_name, frac_outside_1pt5_IQR(column_data.to_numpy()) * 100)
Year 3.4487515519381984 Kilometers_Driven 6.387087874189544 Power 9.049524072285832 Seats 15.90564215753897 Price 16.73334253000414 Engine_num 7.366533314939992 Mileage_num 2.552076148434267 price_log 15.422816940267623
Seats and Price data has over 15% of the sample is flagged as outliers using this method.
for (column_name, column_data) in numerical_features.iteritems():
quartiles = np.quantile(df[column_name][df[column_name].notnull()], [0.25, 0.75])
column_name_4iqr = 4 * (quartiles[1] - quartiles[0])
print(
"\n",
column_name,
f"Q1 = {quartiles[0]}, Q3 = {quartiles[1]}, 4*IQR = {column_name_4iqr}",
)
# create a dataframe of each place where the difference between value and median for that variable are greater than 4*IQR for that variable
outlier_column_name = df.loc[
np.abs(df[column_name] - df[column_name].median()) > column_name_4iqr,
column_name,
]
# check len and values for each
print("\n", outlier_column_name)
Year Q1 = 2011.0, Q3 = 2016.0, 4*IQR = 20.0
Series([], Name: Year, dtype: int64)
Kilometers_Driven Q1 = 34000.0, Q3 = 73000.0, 4*IQR = 156000.0
S.No.
29 262000
77 230000
181 216000
340 775000
358 620000
838 215000
1068 225000
1101 248000
1442 231673
1528 299322
1766 220000
1846 255000
1860 720000
1975 282000
2267 215750
2328 6500000
2346 234000
2823 480000
2988 240000
3092 480000
3649 300000
3713 232000
3894 250000
4491 445000
4719 250000
4792 242000
5647 227000
5800 210000
5834 250000
6919 290000
6921 350000
Name: Kilometers_Driven, dtype: int64
Power Q1 = 77.0, Q3 = 138.03, 4*IQR = 244.12
S.No.
70 500.00
134 362.07
148 444.00
152 362.90
418 367.00
589 364.90
1930 364.90
2095 362.07
2100 550.00
2978 394.30
3132 340.00
3341 402.00
4061 444.00
4342 364.90
4451 395.00
4627 450.00
4691 421.00
4722 387.30
4821 450.00
5088 503.00
5341 382.00
5521 552.00
5603 394.30
5781 560.00
5919 488.10
6186 364.90
6354 616.00
6960 362.07
7057 450.00
7117 360.00
Name: Power, dtype: float64
Seats Q1 = 5.0, Q3 = 5.0, 4*IQR = 0.0
S.No.
3 7
7 8
15 7
29 7
32 7
..
7175 7
7194 4
7198 7
7206 7
7223 7
Name: Seats, Length: 1153, dtype: int64
Price Q1 = 3.85, Q3 = 8.4, 4*IQR = 18.200000000000003
S.No.
13 27.00
19 28.00
38 28.00
62 26.70
67 35.67
...
5927 45.52
5946 48.00
5970 26.76
5996 30.54
6008 45.00
Name: Price, Length: 553, dtype: float64
Engine_num Q1 = 1198.0, Q3 = 1968.0, 4*IQR = 3080.0
S.No.
70 4806
152 5461
2100 4806
2978 4806
4451 4951
4691 5461
4722 5461
5088 5000
5521 5998
5603 4806
5781 5204
5919 5000
6354 5998
Name: Engine_num, dtype: int64
Mileage_num Q1 = 15.4, Q3 = 21.1, 4*IQR = 22.800000000000004
Series([], Name: Mileage_num, dtype: float64)
price_log Q1 = 1.3480731482996928, Q3 = 2.128231705849268, 4*IQR = 3.1206342301983003
S.No.
4079 5.08
Name: price_log, dtype: float64
Observations
Year,Engine and Power is not generating a series to view the values and so has none outside in the 4*IQRKilometers_Driven has relatively few outliers and could simply drop them or replace with nan.Price and Seats have many outliers in this range and need to be dealt with without dropping or replacing with nans on all values.for (column_name, column_data) in numerical_features.iteritems():
plt.hist(column_data, 20)
plt.title("Histogram")
display(column_name)
plt.show()
sns.boxplot(column_data)
plt.title("Boxplot")
display(column_name)
plt.show()
'Year'
'Year'
'Kilometers_Driven'
'Kilometers_Driven'
'Power'
'Power'
'Seats'
'Seats'
'Price'
'Price'
'Engine_num'
'Engine_num'
'Mileage_num'
'Mileage_num'
'price_log'
'price_log'
Kilometers_Driven, Power, Price "have a lot of values that are flagged as suspicious by the boxplot, but in the histogram we can see that the distribution is skewed so these points aren't inconsistent with the overall distribution of the data. Nevertheless, having a heavy tail means we might want to consider statistics less sensitive to large values, so e.g. the median may be a better measure of central tendancy."
# let's plot the boxplots of all columns to check for outliers
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_features):
plt.subplot(5, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Let's check outliers against the target variable and see where they are. We see price, Kilometers_Driven and Power have heavy outliers. And some exist in every column.
# plot indepdents against dependent
for i, variable in enumerate(numerical_features):
sns.scatterplot(x=df[variable], y=df.Price)
plt.show()
Observations
Year, Seats, Mileage has outliers in the upper and lower left regions.
Kilometers_Driven is different with outliers on either end.
There are many upper-end outliers in Power, Price, and Engine.
"We will treat these outliers as these might adversely affect the predictive power of linear model. However, in real life, these outliers may be due to non-linear pattern in the data or can be important information. Sometimes outliers in the independent variable can adversely impact the linear model. This can be checked by building the model with and without outliers and comparing the model performances."
Decision:
For the sake of trying to make the data fit a normal distribution more easily, I will go ahead and drop outliers from these keys columns.
Notes from GreatLearning:
"If we decide that we do actually have some problematic outliers, we have a few options.
If the point seems truly nonsensical it may be best to treat it as missing
Alternatively, we could drop that observation or we could use statistics that are robust to outliers
Could replace with the max value from the column or a string value
It's often a good idea to examine the sensitivity to outliers by running an analysis with and without them. If they are because of the types of data then don't drop. But statistically, it is better to cut them off. "
Let's examine some of the maximum values that stand out as outliers. They could be data entry errors.
Extreme values in Kilometers_Driven.
# box plot
sns.boxplot(df.Kilometers_Driven)
<AxesSubplot:xlabel='Kilometers_Driven'>
# view the highest and lowest values
df[~(df["Kilometers_Driven"] < 750000)] # | (df["Kilometers_Driven"] > 41000)]
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 340 | Skoda Octavia Ambition Plus 2.0 TDI AT | Kolkata | 2013 | 775000 | Diesel | Automatic | First | 141.00 | 5 | 7.50 | 1968 | 19.30 | 2.01 |
| 2328 | BMW X5 xDrive 30d M Sport | Chennai | 2017 | 6500000 | Diesel | Automatic | First | 258.00 | 5 | 65.00 | 2993 | 15.97 | 4.17 |
650k KM doesn't make sense in 4 year span. so we'll drop it.
# drop the high values by row
df.drop(labels=[2328], axis=0, inplace=True)
Check the drop.
# view the high values
df[~(df["Kilometers_Driven"] < 750000)] # | (df["Kilometers_Driven"] > +15)]
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 340 | Skoda Octavia Ambition Plus 2.0 TDI AT | Kolkata | 2013 | 775000 | Diesel | Automatic | First | 141.00 | 5 | 7.50 | 1968 | 19.30 | 2.01 |
Fixing extreme values in Seats
df.sort_values(by=["Seats"], ascending=True)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 3999 | Audi A4 3.2 FSI Tiptronic Quattro | Hyderabad | 2012 | 125000 | Petrol | Automatic | First | 94.34 | 0 | 18.00 | 3197 | 10.50 | 2.89 |
| 5781 | Lamborghini Gallardo Coupe | Delhi | 2011 | 6500 | Petrol | Automatic | Third | 560.00 | 2 | 120.00 | 5204 | 6.40 | 4.79 |
| 4722 | Mercedes-Benz SL-Class SL 500 | Kolkata | 2010 | 35000 | Petrol | Automatic | First | 387.30 | 2 | 29.50 | 5461 | 8.10 | 3.38 |
| 6842 | Nissan 370Z AT | Kolkata | 2012 | 14850 | Petrol | Automatic | First | 328.50 | 2 | 5.64 | 3696 | 10.00 | 1.73 |
| 926 | Porsche Cayman 2009-2012 S | Hyderabad | 2010 | 10000 | Petrol | Manual | First | 94.34 | 2 | 40.00 | 3436 | 9.00 | 3.69 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1907 | Toyota Qualis FS B3 | Bangalore | 2002 | 63000 | Diesel | Manual | Third | 75.00 | 10 | 3.65 | 2446 | 13.10 | 1.29 |
| 2267 | Toyota Qualis RS E2 | Pune | 2004 | 215750 | Diesel | Manual | Second | 94.34 | 10 | 3.50 | 2446 | 18.20 | 1.25 |
| 2575 | Chevrolet Tavera LS B3 10 Seats BSIII | Hyderabad | 2015 | 120000 | Diesel | Manual | First | 80.00 | 10 | 5.50 | 2499 | 14.80 | 1.70 |
| 6242 | Tata Sumo EX 10/7 Str BSII | Chennai | 2015 | 196000 | Diesel | Manual | Second | 68.00 | 10 | 5.64 | 1948 | 12.20 | 1.73 |
| 917 | Tata Sumo DX | Mumbai | 2013 | 20000 | Diesel | Manual | First | 83.10 | 10 | 5.25 | 1978 | 14.07 | 1.66 |
7248 rows × 13 columns
Here we will drop the 0 value in Seats which is likely a data-entry error.
# drop the high value
df.drop(labels=3999, axis=0, inplace=True)
# sort df by column
df.sort_values(by=["Seats"], ascending=True)
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 5781 | Lamborghini Gallardo Coupe | Delhi | 2011 | 6500 | Petrol | Automatic | Third | 560.00 | 2 | 120.00 | 5204 | 6.40 | 4.79 |
| 6960 | Mercedes-Benz SLC 43 AMG | Coimbatore | 2018 | 18338 | Petrol | Automatic | First | 362.07 | 2 | 5.64 | 2996 | 19.00 | 1.73 |
| 1078 | Porsche Boxster S tiptronic | Kolkata | 2015 | 10512 | Petrol | Automatic | First | 265.00 | 2 | 64.00 | 2706 | 8.60 | 4.16 |
| 2305 | Porsche Cayman 2009-2012 S tiptronic | Mumbai | 2011 | 8000 | Petrol | Automatic | First | 94.34 | 2 | 43.00 | 3436 | 9.00 | 3.76 |
| 5919 | Jaguar F Type 5.0 V8 S | Hyderabad | 2015 | 8000 | Petrol | Automatic | First | 488.10 | 2 | 100.00 | 5000 | 12.50 | 4.61 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6875 | Toyota Qualis FS B3 | Pune | 2002 | 119613 | Diesel | Manual | Second | 75.00 | 10 | 5.64 | 2446 | 13.10 | 1.73 |
| 814 | Toyota Qualis FS B2 | Pune | 2004 | 77757 | Diesel | Manual | Second | 75.00 | 10 | 3.50 | 2446 | 13.10 | 1.25 |
| 1907 | Toyota Qualis FS B3 | Bangalore | 2002 | 63000 | Diesel | Manual | Third | 75.00 | 10 | 3.65 | 2446 | 13.10 | 1.29 |
| 6288 | Chevrolet Tavera LS B3 10 Seats BSIII | Hyderabad | 2005 | 150000 | Diesel | Manual | Second | 80.00 | 10 | 5.64 | 2499 | 14.80 | 1.73 |
| 2575 | Chevrolet Tavera LS B3 10 Seats BSIII | Hyderabad | 2015 | 120000 | Diesel | Manual | First | 80.00 | 10 | 5.50 | 2499 | 14.80 | 1.70 |
7247 rows × 13 columns
# id row where Seats = 0
df[df.Seats == 0.00]
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. |
# let's look at box plot to see if outliers have been treated or not
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_features):
plt.subplot(5, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Price
Decision:
Without having more information about the target market for this business its hard to decide whether to remove the highest price Used Cars over 150Lahks. I imagine in many cases they are purchased new and not used. I am only losing ~1% of the dataset which seems reasonable. For the sake the modeling we are doing here and the need to normalize it, I will drop them.
# check the drop
df[df.Price >= 150.00]
| Name | Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 4079 | Land Rover Range Rover 3.0 Diesel LWB Vogue | Hyderabad | 2017 | 25000 | Diesel | Automatic | First | 255.00 | 5 | 160.00 | 2993 | 13.33 | 5.08 |
# drop rows
df.drop(df[df.Price >= 150.00].index, inplace=True)
# let's look at box plot to see if outliers have been treated or not
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numerical_features):
plt.subplot(5, 4, i + 1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# plot indepdents against dependent
for i, variable in enumerate(numerical_features):
sns.scatterplot(x=df[variable], y=df.Price)
plt.show()
# get the size of the current df vs the original
display(len(df))
display("retained", (len(df) / 7253) * 100, "% of original data")
7246
'retained'
99.90348821177444
'% of original data'
After treatment we only lost 5 records from our original 7253 records.
Observations
Names¶As a key feature against our target varialbe (Price), Name needs to get reassigned into two separate variables to account for the inherent data-structure (Brand + Model)
df["Name"].nunique()
2037
I will try to pull out the Make vs the Model to make it simpler.
# import module for Categorical
from pandas.api.types import CategoricalDtype
# split the column after the first space
cat_type = CategoricalDtype(categories=list("abcd"), ordered=True)
df[["Brand", "Model"]] = df.Name.str.split(" ", 1, expand=True)
# convert to categorical
df["Brand"] = pd.Categorical(df.Brand)
df["Model"] = pd.Categorical(df.Model)
df[["Brand", "Model"]].dtypes
Brand category Model category dtype: object
# value counts
df["Brand"].value_counts()
Maruti 1444 Hyundai 1340 Honda 741 Toyota 506 Mercedes-Benz 380 Volkswagen 374 Ford 351 Mahindra 330 BMW 311 Audi 284 Tata 228 Skoda 202 Renault 170 Chevrolet 151 Nissan 117 Land 66 Jaguar 48 Fiat 38 Mitsubishi 36 Mini 31 Volvo 28 Jeep 19 Porsche 19 Datsun 17 ISUZU 3 Force 3 Isuzu 2 Bentley 2 Ambassador 1 Smart 1 OpelCorsa 1 Hindustan 1 Lamborghini 1 Name: Brand, dtype: int64
# plot
plt.figure(figsize=(15, 7))
sns.countplot(y="Brand", data=df, order=df["Brand"].value_counts().index)
<AxesSubplot:xlabel='count', ylabel='Brand'>
After separating out the Name I can safely drop it.
# drop a column
df.drop("Name", inplace=True, axis=1)
Let's review our new columns
Brand.¶Let's check out our Brand by Price to find breakdowns of car classes. Here we look at mean price by brand.
# find unique count
df.Brand.nunique()
33
Correct Model for first name.
df["Model"] = df.Model.str.split(" ", 1, expand=True)
# find unique count
df.Model.nunique()
217
# Check the data
df["Model"].value_counts()
Swift 418
City 316
i20 303
Innova 203
Verna 200
...
Countryman 1
Beetle 1
CLS-Class 1
Boxster 1
1000 1
Name: Model, Length: 217, dtype: int64
plt.figure(figsize=(15, 7))
sns.countplot(y="Model", data=df, order=df["Model"].value_counts().index[:30])
<AxesSubplot:xlabel='count', ylabel='Model'>
Since there are so many models lets see if we can reduce the load on the visualizations and modeling to come up with some more generalized accounts of different models.
217 unique values is better than over 2000 but is still alot. For the sake of a better model, let's drop this level of specificity and focus on Brand. We will create categories based on price.
# View mean of one variable by another
df.groupby(["Brand"])["Price"].mean().sort_values(ascending=False)
Brand Lamborghini 120.00 Porsche 46.10 Land 33.86 Bentley 32.32 Jaguar 32.30 Mini 23.47 Mercedes-Benz 23.36 BMW 22.28 Audi 22.13 Jeep 15.97 Volvo 15.51 Isuzu 12.82 Toyota 10.45 ISUZU 9.91 Mitsubishi 9.70 Force 9.33 Mahindra 7.60 Skoda 7.28 Ford 6.71 Renault 5.78 OpelCorsa 5.64 Hindustan 5.64 Honda 5.45 Hyundai 5.40 Volkswagen 5.36 Nissan 4.94 Maruti 4.70 Tata 3.95 Fiat 3.89 Datsun 3.66 Chevrolet 3.56 Smart 3.00 Ambassador 1.35 Name: Price, dtype: float64
Observations
# make lower case
df.Brand = df.Brand.str.title()
# View mean of one variable by another
df.Brand.unique()
array(['Maruti', 'Hyundai', 'Honda', 'Audi', 'Nissan', 'Toyota',
'Volkswagen', 'Tata', 'Land', 'Mitsubishi', 'Renault',
'Mercedes-Benz', 'Bmw', 'Mahindra', 'Ford', 'Porsche', 'Datsun',
'Jaguar', 'Volvo', 'Chevrolet', 'Skoda', 'Mini', 'Fiat', 'Jeep',
'Smart', 'Ambassador', 'Isuzu', 'Force', 'Bentley', 'Lamborghini',
'Hindustan', 'Opelcorsa'], dtype=object)
# get lowest and highest car values
df.loc[(df.Brand == "Ambassador") | (df.Brand == "Lamborghini")]
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | Brand | Model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | ||||||||||||||
| 1221 | Chennai | 2003 | 80000 | Diesel | Manual | Third | 35.50 | 5 | 1.35 | 1489 | 12.80 | 0.30 | Ambassador | Classic |
| 5781 | Delhi | 2011 | 6500 | Petrol | Automatic | Third | 560.00 | 2 | 120.00 | 5204 | 6.40 | 4.79 | Lamborghini | Gallardo |
Brand¶We will bucket the Brands or Makes into Economic Types.
Let's bin by generalized appearance of breakdown from the Price column.
df["Car_category"] = pd.cut(df.Price, bins=[0, 15, 30, 50, 200])
df.head()
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | Brand | Model | Car_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||||
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 58.16 | 5 | 1.75 | 998 | 26.60 | 0.56 | Maruti | Wagon | (0, 15] |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 126.20 | 5 | 12.50 | 1582 | 19.67 | 2.53 | Hyundai | Creta | (0, 15] |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 88.70 | 5 | 4.50 | 1199 | 18.20 | 1.50 | Honda | Jazz | (0, 15] |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 88.76 | 7 | 6.00 | 1248 | 20.77 | 1.79 | Maruti | Ertiga | (0, 15] |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 140.80 | 5 | 17.74 | 1968 | 15.20 | 2.88 | Audi | A4 | (15, 30] |
# value counts
df["Car_category"].value_counts().sort_index()
(0, 15] 6265 (15, 30] 634 (30, 50] 259 (50, 200] 88 Name: Car_category, dtype: int64
It's clear that the majority of the used cars are between 0 - 10 Lakh's. Let's label the bins by some generic car terminology.
# drop column
df.drop(columns="Car_category", inplace=True)
# make cut with labels
df["Car_category"] = pd.cut(
df.Price,
bins=[0, 15, 30, 50, 100],
labels=["Budget_Friendly", "Mid_Range", "Luxury_Cars", "Ultra_luxury"],
)
df.head()
| Location | Year | Kilometers_Driven | Fuel_Type | Transmission | Owner_Type | Power | Seats | Price | Engine_num | Mileage_num | price_log | Brand | Model | Car_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||||
| 0 | Mumbai | 2010 | 72000 | CNG | Manual | First | 58.16 | 5 | 1.75 | 998 | 26.60 | 0.56 | Maruti | Wagon | Budget_Friendly |
| 1 | Pune | 2015 | 41000 | Diesel | Manual | First | 126.20 | 5 | 12.50 | 1582 | 19.67 | 2.53 | Hyundai | Creta | Budget_Friendly |
| 2 | Chennai | 2011 | 46000 | Petrol | Manual | First | 88.70 | 5 | 4.50 | 1199 | 18.20 | 1.50 | Honda | Jazz | Budget_Friendly |
| 3 | Chennai | 2012 | 87000 | Diesel | Manual | First | 88.76 | 7 | 6.00 | 1248 | 20.77 | 1.79 | Maruti | Ertiga | Budget_Friendly |
| 4 | Coimbatore | 2013 | 40670 | Diesel | Automatic | Second | 140.80 | 5 | 17.74 | 1968 | 15.20 | 2.88 | Audi | A4 | Mid_Range |
I will need to one-hot-encode these categories for modeling. For now lets make sure they are categorical types.
cat_vars = ["Car_category", "Model", "Brand"]
for colname in cat_vars:
df[colname] = df[colname].astype("category")
df.info() # Explore dataframe information, check dtype, and nulls.
<class 'pandas.core.frame.DataFrame'> Int64Index: 7246 entries, 0 to 7252 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Location 7246 non-null category 1 Year 7246 non-null int64 2 Kilometers_Driven 7246 non-null int64 3 Fuel_Type 7246 non-null category 4 Transmission 7246 non-null category 5 Owner_Type 7246 non-null category 6 Power 7246 non-null float64 7 Seats 7246 non-null int64 8 Price 7246 non-null float64 9 Engine_num 7246 non-null int64 10 Mileage_num 7246 non-null float64 11 price_log 7246 non-null float64 12 Brand 7246 non-null category 13 Model 7246 non-null category 14 Car_category 7245 non-null category dtypes: category(7), float64(4), int64(4) memory usage: 836.3 KB
# get value counts
df.Car_category.value_counts()
Budget_Friendly 6265 Mid_Range 634 Luxury_Cars 259 Ultra_luxury 87 Name: Car_category, dtype: int64
# 5 (Quantile) Statistics for numerical variables
df.describe()
| Year | Kilometers_Driven | Power | Seats | Price | Engine_num | Mileage_num | price_log | |
|---|---|---|---|---|---|---|---|---|
| count | 7246.00 | 7246.00 | 7246.00 | 7246.00 | 7246.00 | 7246.00 | 7246.00 | 7246.00 |
| mean | 2013.37 | 57810.70 | 112.30 | 5.28 | 8.80 | 1615.41 | 18.35 | 1.81 |
| std | 3.25 | 37504.19 | 52.88 | 0.81 | 10.12 | 592.75 | 4.13 | 0.80 |
| min | 1996.00 | 171.00 | 34.20 | 2.00 | 0.44 | 624.00 | 6.40 | -0.82 |
| 25% | 2011.00 | 34000.00 | 77.00 | 5.00 | 3.85 | 1198.00 | 15.40 | 1.35 |
| 50% | 2014.00 | 53469.50 | 94.34 | 5.00 | 5.64 | 1493.00 | 18.20 | 1.73 |
| 75% | 2016.00 | 73000.00 | 138.03 | 5.00 | 8.40 | 1968.00 | 21.10 | 2.13 |
| max | 2019.00 | 775000.00 | 616.00 | 10.00 | 120.00 | 5998.00 | 33.54 | 4.79 |
# Basic statistics for Categorical variables
df.describe(include=["category"])
| Location | Fuel_Type | Transmission | Owner_Type | Brand | Model | Car_category | |
|---|---|---|---|---|---|---|---|
| count | 7246 | 7246 | 7246 | 7246 | 7246 | 7246 | 7245 |
| unique | 11 | 4 | 2 | 4 | 32 | 217 | 4 |
| top | Mumbai | Diesel | Manual | First | Maruti | Swift | Budget_Friendly |
| freq | 946 | 3850 | 5202 | 5945 | 1444 | 418 | 6265 |
# Function from GreatLearning on %'s for bar graphs
def per_on_bar(plot, feature):
total = len(feature)
for p in ax.patches:
perc = "{:1.f}%".format(100 * p.get_height() / total)
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + g.get_height()
ax.annotate(perc, (x, y), size=12)
plt.show()
categoricals = df.select_dtypes(include=["object", "category", "string"])
global categoricals
[categoricals]
[ Location Fuel_Type Transmission Owner_Type Brand Model \
S.No.
0 Mumbai CNG Manual First Maruti Wagon
1 Pune Diesel Manual First Hyundai Creta
2 Chennai Petrol Manual First Honda Jazz
3 Chennai Diesel Manual First Maruti Ertiga
4 Coimbatore Diesel Automatic Second Audi A4
... ... ... ... ... ... ...
7248 Hyderabad Diesel Manual First Volkswagen Vento
7249 Mumbai Petrol Automatic First Volkswagen Polo
7250 Kolkata Diesel Manual First Nissan Micra
7251 Pune Petrol Automatic Third Volkswagen Polo
7252 Kochi Diesel Automatic First Mercedes-Benz E-Class
Car_category
S.No.
0 Budget_Friendly
1 Budget_Friendly
2 Budget_Friendly
3 Budget_Friendly
4 Mid_Range
... ...
7248 Budget_Friendly
7249 Budget_Friendly
7250 Budget_Friendly
7251 Budget_Friendly
7252 Budget_Friendly
[7246 rows x 7 columns]]
# Function to perform statistical analysis on every numeric and categorical variable in the dataframe
pd.set_option("display.max_colwidth", 5000)
def univar_num(dataframe): # argument is the whole dataframe
# Print statistics for numeric variables
print("\n Mode Analysis \n", df.mode("index")[0:2]) # Mode (or most popular)
print("\n Sum Analysis \n", numerical_features.sum()) # Sum analysis
print("\n Variance analysis \n", numerical_features.var()) # Variance analysis
print(
"\n Absolute Deviation or Mean Absolute Deviation \n", numerical_features.mad()
) # Absolute Deviation or Mean Absolute Deviation
print("\n Skew analysis \n ", numerical_features.skew()) # Skew analysis
return
# Run dataframe through function
univar_num(numerical_features)
Mode Analysis
Location Year Kilometers_Driven Fuel_Type Transmission Owner_Type Power \
0 Mumbai 2015 60000 Diesel Manual First 74.00
Seats Price Engine_num Mileage_num price_log Brand Model \
0 5 5.64 1197 17.00 1.73 Maruti Swift
Car_category
0 Budget_Friendly
Sum Analysis
Year 14594892.00
Kilometers_Driven 425546305.00
Power 814311.69
Seats 38259.00
Price 63979.65
Engine_num 11714459.00
Mileage_num 132984.80
price_log 13111.35
dtype: float64
Variance analysis
Year 10.59
Kilometers_Driven 7131921139.95
Power 2801.12
Seats 0.65
Price 106.00
Engine_num 352069.92
Mileage_num 17.10
price_log 0.64
dtype: float64
Absolute Deviation or Mean Absolute Deviation
Year 2.58
Kilometers_Driven 26624.55
Power 39.54
Seats 0.53
Price 6.22
Engine_num 460.64
Mileage_num 3.31
price_log 0.57
dtype: float64
Skew analysis
Year -0.84
Kilometers_Driven 61.57
Power 2.00
Seats 1.91
Price 3.73
Engine_num 1.42
Mileage_num 0.21
price_log 0.52
dtype: float64
# Frequency tables for every feature
for (
column_name,
column_data,
) in (
df.iteritems()
): # for every column in the dataframe return the column name and Series of data
display((column_name, column_data.value_counts()))
('Location',
Mumbai 946
Hyderabad 874
Coimbatore 772
Kochi 772
Pune 765
Delhi 660
Kolkata 654
Chennai 589
Jaipur 499
Bangalore 440
Ahmedabad 275
Name: Location, dtype: int64)
('Year',
2015 929
2014 925
2016 885
2013 791
2017 707
2012 689
2011 578
2010 405
2018 361
2009 252
2008 207
2007 148
2019 119
2006 89
2005 68
2004 35
2003 20
2002 18
2001 8
2000 5
1998 4
1999 2
1996 1
Name: Year, dtype: int64)
('Kilometers_Driven',
60000 96
45000 86
65000 86
70000 77
50000 71
..
24350 1
58217 1
60268 1
21363 1
64917 1
Name: Kilometers_Driven, Length: 3659, dtype: int64)
('Fuel_Type',
Diesel 3850
Petrol 3322
CNG 62
LPG 12
Name: Fuel_Type, dtype: int64)
('Transmission',
Manual 5202
Automatic 2044
Name: Transmission, dtype: int64)
('Owner_Type',
First 5945
Second 1152
Third 137
Fourth & Above 12
Name: Owner_Type, dtype: int64)
('Power',
74.00 280
94.34 174
98.60 166
73.90 152
140.00 142
...
328.50 1
174.57 1
560.00 1
231.10 1
83.11 1
Name: Power, Length: 383, dtype: int64)
('Seats',
5 6094
7 796
8 170
4 119
6 38
2 18
10 8
9 3
Name: Seats, dtype: int64)
('Price',
5.64 1238
4.50 88
5.50 84
3.50 82
4.25 73
...
14.29 1
21.26 1
55.70 1
43.74 1
2.22 1
Name: Price, Length: 1372, dtype: int64)
('Engine_num',
1197 732
1248 610
1498 370
998 309
1198 281
2179 278
1497 271
1968 266
1995 212
1461 188
2143 184
1199 177
1582 174
1396 170
1598 164
796 164
2494 154
1086 129
1591 113
1399 106
2993 104
2982 103
1493 103
1798 99
2987 78
814 74
2967 67
1120 60
1373 59
1364 58
1196 57
2354 53
1298 48
2755 42
1991 41
799 40
1896 39
1799 39
999 36
1998 35
1496 33
1061 32
624 31
1796 29
2393 28
1586 26
2498 26
1499 25
936 25
1405 25
1794 24
1193 24
1984 22
2523 22
1997 22
1999 21
2148 21
1495 20
1186 18
2477 17
2199 17
1956 17
1368 17
2499 16
995 15
3198 15
2497 14
993 14
2696 14
3498 13
2400 13
1595 13
1599 13
2835 12
1341 11
1388 11
1299 11
2953 10
1950 9
2198 8
1150 8
4134 8
2489 8
1596 7
2996 7
2979 7
1462 7
1172 6
1590 6
2698 6
2496 6
1047 5
2446 5
793 5
4367 5
2359 5
2609 5
1390 4
4395 4
4806 4
2362 4
1343 4
1797 4
1194 3
2349 3
2956 3
5461 3
1395 3
1969 3
1948 2
5998 2
3597 2
2360 2
2894 2
2200 2
2997 2
2495 2
2771 2
1242 2
2487 2
5000 2
1985 2
2999 2
1781 2
1597 2
2147 2
3436 2
1468 1
1978 1
4951 1
2149 1
1795 1
2694 1
2773 1
2925 1
2995 1
3200 1
1389 1
2112 1
5204 1
2706 1
3696 1
2720 1
1489 1
1422 1
2000 1
2092 1
970 1
Name: Engine_num, dtype: int64)
('Mileage_num',
17.00 206
18.90 201
18.60 144
18.20 115
21.10 107
...
16.12 1
12.97 1
22.80 1
11.62 1
7.50 1
Name: Mileage_num, Length: 437, dtype: int64)
('price_log',
1.73 1238
1.50 88
1.70 84
1.25 82
1.45 73
...
0.63 1
1.43 1
4.17 1
1.38 1
4.53 1
Name: price_log, Length: 1372, dtype: int64)
('Brand',
Maruti 1444
Hyundai 1340
Honda 741
Toyota 506
Mercedes-Benz 380
Volkswagen 374
Ford 351
Mahindra 330
Bmw 311
Audi 284
Tata 228
Skoda 202
Renault 170
Chevrolet 151
Nissan 117
Land 66
Jaguar 48
Fiat 38
Mitsubishi 36
Mini 31
Volvo 28
Jeep 19
Porsche 19
Datsun 17
Isuzu 5
Force 3
Bentley 2
Lamborghini 1
Opelcorsa 1
Hindustan 1
Smart 1
Ambassador 1
Name: Brand, dtype: int64)
('Model',
Swift 418
City 316
i20 303
Innova 203
Verna 200
...
Fusion 1
Venture 1
Gallardo 1
Land 1
Motors 1
Name: Model, Length: 217, dtype: int64)
('Car_category',
Budget_Friendly 6265
Mid_Range 634
Luxury_Cars 259
Ultra_luxury 87
Name: Car_category, dtype: int64)
Observations:
Top Brand is Maruti and Model is Swift.
The top Year for Used Cars is 2015.
Most Owners are first time for these cars.
# Function to create distribution and statistical plots for each numerical feature in the dataframe
def univar_vis(dataframe):
sns.set(style="ticks")
sns.set_style("darkgrid")
# Univariate visualization of categoricals variables
for (column_name_c, column_data_c) in categoricals.iteritems():
print(column_name_c, "Count Plot")
plt.figure(figsize=(20, 10))
sns.countplot(x=column_data_c) # Bar plot against a numeric variable
plt.show()
"""
print(column_name_c, "Distribution Plot")
plt.figure(figsize=(20, 10))
sns.displot(data=column_data_c) # distribution plot
plt.show()
"""
# Univariate visualization of numeric variables
for (column_name, column_data) in numerical_features.iteritems():
print(column_name, "Histogram")
sns.histplot(df[column_name], kde=True) # histogram
plt.figure(figsize=(20, 10))
plt.show()
"""
print(column_name, "Distribution Plot")
plt.figure(figsize=(20, 10))
sns.displot(data=column_data) # distribution plot
plt.show()
"""
print(column_name, "Density Plot")
sns.kdeplot(
data=column_data, x=numerical_features[column_name]
) # Kernel distribution
plt.figure(figsize=(20, 10))
plt.show()
print(column_name, "Box Plot")
sns.boxplot(data=column_data, y=df[column_name]) # Box plot
plt.figure(figsize=(20, 10))
plt.show()
# raise Exception("End to kill infinite loop")
return
# Run dataframe through function
univar_vis(df)
Location Count Plot
Fuel_Type Count Plot
Transmission Count Plot
Owner_Type Count Plot
Brand Count Plot
Model Count Plot
Car_category Count Plot
Year Histogram
<Figure size 1440x720 with 0 Axes>
Year Density Plot
<Figure size 1440x720 with 0 Axes>
Year Box Plot
<Figure size 1440x720 with 0 Axes>
Kilometers_Driven Histogram
<Figure size 1440x720 with 0 Axes>
Kilometers_Driven Density Plot
<Figure size 1440x720 with 0 Axes>
Kilometers_Driven Box Plot
<Figure size 1440x720 with 0 Axes>
Power Histogram
<Figure size 1440x720 with 0 Axes>
Power Density Plot
<Figure size 1440x720 with 0 Axes>
Power Box Plot
<Figure size 1440x720 with 0 Axes>
Seats Histogram
<Figure size 1440x720 with 0 Axes>
Seats Density Plot
<Figure size 1440x720 with 0 Axes>
Seats Box Plot
<Figure size 1440x720 with 0 Axes>
Price Histogram
<Figure size 1440x720 with 0 Axes>
Price Density Plot
<Figure size 1440x720 with 0 Axes>
Price Box Plot
<Figure size 1440x720 with 0 Axes>
Engine_num Histogram
<Figure size 1440x720 with 0 Axes>
Engine_num Density Plot
<Figure size 1440x720 with 0 Axes>
Engine_num Box Plot
<Figure size 1440x720 with 0 Axes>
Mileage_num Histogram
<Figure size 1440x720 with 0 Axes>
Mileage_num Density Plot
<Figure size 1440x720 with 0 Axes>
Mileage_num Box Plot
<Figure size 1440x720 with 0 Axes>
price_log Histogram
<Figure size 1440x720 with 0 Axes>
price_log Density Plot
<Figure size 1440x720 with 0 Axes>
price_log Box Plot
<Figure size 1440x720 with 0 Axes>
# From DSBA_SLF_MLS1_LifeExpectancyCaseStudy
all_col = df.select_dtypes(include=["float64", "int64"]).columns.tolist()
plt.figure(figsize=(17, 75))
for i in range(len(all_col)):
plt.subplot(18, 3, i + 1)
plt.hist(df[all_col[i]])
sns.histplot(
df[all_col[i]], kde=True
) # you can comment the previous line and run this one to get distribution curves
plt.tight_layout()
plt.title(all_col[i], fontsize=25)
plt.show()
### Code from Mounika of GreatLearning as given in the Uber Case study ###
# While doing uni-variate analysis of numerical variables we want to study their central tendency
# and dispersion.
# Let us write a function that will help us create boxplot and histogram for any input numerical
# variable.
# This function takes the numerical column as the input and returns the boxplots
# and histograms for the variable.
# Let us see if this help us write faster and cleaner code.
def histogram_boxplot(feature, figsize=(15, 10), bins=None):
"""Boxplot and histogram combined
feature: 1-d feature array
figsize: size of fig (default (9,8))
bins: number of bins (default None / auto)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
feature, ax=ax_box2, showmeans=True, color="red"
) # boxplot will be created and a star will indicate the mean value of the column
sns.distplot(feature, kde=F, ax=ax_hist2, bins=bins) if bins else sns.distplot(
feature, kde=False, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
np.mean(feature), color="g", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
np.median(feature), color="black", linestyle="-"
) # Add median to the histogram
display(numerical_features.apply(lambda x: histogram_boxplot(x)))
Year None Kilometers_Driven None Power None Seats None Price None Engine_num None Mileage_num None price_log None dtype: object
# Covariance analysis
df.cov()
| Year | Kilometers_Driven | Power | Seats | Price | Engine_num | Mileage_num | price_log | |
|---|---|---|---|---|---|---|---|---|
| Year | 10.59 | -54876.52 | 4.16 | 0.03 | 9.15 | -102.53 | 3.98 | 1.20 |
| Kilometers_Driven | -54876.52 | 1406564098.90 | -2580.42 | 6435.57 | -58965.28 | 3509610.72 | -22085.82 | -5261.39 |
| Power | 4.16 | -2580.42 | 2796.50 | 4.18 | 379.00 | 26799.62 | -118.73 | 29.28 |
| Seats | 0.03 | 6435.57 | 4.18 | 0.65 | 0.41 | 193.12 | -1.16 | 0.10 |
| Price | 9.15 | -58965.28 | 379.00 | 0.41 | 102.44 | 3637.38 | -12.61 | 6.87 |
| Engine_num | -102.53 | 3509610.72 | 26799.62 | 193.12 | 3637.38 | 351346.92 | -1564.60 | 297.06 |
| Mileage_num | 3.98 | -22085.82 | -118.73 | -1.16 | -12.61 | -1564.60 | 17.09 | -0.90 |
| price_log | 1.20 | -5261.39 | 29.28 | 0.10 | 6.87 | 297.06 | -0.90 | 0.63 |
# Correlation Analysis
corr = df.corr()
display(corr)
| Year | Kilometers_Driven | Power | Seats | Price | Engine_num | Mileage_num | price_log | |
|---|---|---|---|---|---|---|---|---|
| Year | 1.00 | -0.45 | 0.02 | 0.01 | 0.28 | -0.05 | 0.30 | 0.46 |
| Kilometers_Driven | -0.45 | 1.00 | -0.00 | 0.21 | -0.16 | 0.16 | -0.14 | -0.18 |
| Power | 0.02 | -0.00 | 1.00 | 0.10 | 0.71 | 0.85 | -0.54 | 0.70 |
| Seats | 0.01 | 0.21 | 0.10 | 1.00 | 0.05 | 0.40 | -0.35 | 0.15 |
| Price | 0.28 | -0.16 | 0.71 | 0.05 | 1.00 | 0.61 | -0.30 | 0.85 |
| Engine_num | -0.05 | 0.16 | 0.85 | 0.40 | 0.61 | 1.00 | -0.64 | 0.63 |
| Mileage_num | 0.30 | -0.14 | -0.54 | -0.35 | -0.30 | -0.64 | 1.00 | -0.27 |
| price_log | 0.46 | -0.18 | 0.70 | 0.15 | 0.85 | 0.63 | -0.27 | 1.00 |
Looking for anything close to one, anything above .8 is of interest.
# Plot the heatmap of correlations
plt.figure(figsize=(16, 12))
display("")
sns.heatmap(
corr,
annot=True,
cmap="coolwarm",
fmt=".1f",
### Categorical variables xticklabels=corr.columns,
yticklabels=corr.columns,
)
''
<AxesSubplot:>
Observations:
Power and Engine are highly positively correlated. One of them will be dropped for the Linear Reg. model.
Power and Price are the next in line.
Price and Engine & Mileage and Engine are 3rd at .6 (-.06) degree of correlation strength. The rest of data is not very correlated.
# write Pandas Profile report
profile = ProfileReport(df, title="Pandas Profiling Report")
# create iframe version
profile.to_notebook_iframe()
Price¶Identify the key variables that have a strong relationship with dependent variable
Note The independent variables are operating on a log-scale at this point.
Fuel_Type and Price to get a best guess of linear relationship¶# create a lineplot
plt.figure(figsize=(20, 10))
sns.lineplot(y="Price", x="Transmission", data=df, ci=None)
<AxesSubplot:xlabel='Transmission', ylabel='Price'>
Engine and Price to get a best guess of linear relationship¶# create a lineplot
plt.figure(figsize=(20, 15))
sns.lineplot(y="Price", x="Engine_num", data=df, ci=None)
<AxesSubplot:xlabel='Engine_num', ylabel='Price'>
Power and Price to get a best guess of linear relationship¶# create a lineplot
plt.figure(figsize=(20, 15))
sns.lineplot(y="Price", x="Power", data=df, ci=None)
<AxesSubplot:xlabel='Power', ylabel='Price'>
Temporal variables are specificly used based on the case and understanding how and when to apply them is due to time and lots of practice with them. It's always up to the DS to decide which vars to use and not. It's a subjective process.
# average Price over the years
plt.figure(figsize=(15, 7))
sns.lineplot(x="Year", y="Price", data=df, ci=None)
<AxesSubplot:xlabel='Year', ylabel='Price'>
Observations
Transmission has a negative linear relationship with Price when moving from Automatic to Manual.Engine & Power appear to have positive linear relationships with Price.Price seems to increase over time.# barplot
sns.catplot(
data=df, x=df["Location"], y=df["Price"], kind="box", orient="v", height=5, aspect=2
)
<seaborn.axisgrid.FacetGrid at 0x20ed6bea3a0>
Here are the correlations for looking into multicolinearity
# Get correlation of each variable against price
df[df.columns[:]].corr().sort_values("Price", ascending=False)["Price"][:]
Price 1.00 price_log 0.85 Power 0.71 Engine_num 0.61 Year 0.28 Seats 0.05 Kilometers_Driven -0.16 Mileage_num -0.30 Name: Price, dtype: float64
# GreatLearning code
numeric_columns = df.select_dtypes(include=np.number).columns.tolist()
numeric_columns.remove("Year") # dropping year column as it is temporal variable
corr = (
df[numeric_columns].corr().sort_values(by=["Price"], ascending=False)
) # sorting correlations w.r.t Price
# Set up the matplotlib figure
f, ax = plt.subplots(figsize=(28, 15))
# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(
corr,
mask=np.triu(
corr
), # write a mask to exlucde anything that is less than 0.01 but not easy to work with
cmap="seismic",
annot=True,
fmt=".1f",
vmin=-1,
vmax=1,
center=0,
square=False,
linewidths=0.7,
cbar_kws={"shrink": 0.5},
)
<AxesSubplot:>
Observations
Engine and Power which may be why they are at the top.Price as the dependent variable against every numeric variable colored for categoricals.
# Cross the numeric variables with Price and see the spread across the other categorical variables
for (column_name_c, column_data_c) in numerical_features.items():
for (column_name, column_data) in categoricals.iteritems():
sns.histplot(
x=column_data_c,
y=df.Price,
data=df,
hue=column_name,
multiple="stack",
shrink=0.8,
)
# plt.figure(figsize=(20,20))
plt.show()
Key meaningful observations on the relationship between variables.
S.No. :
Name :
Make and Model to account for wrangling issues in visualization and will be useful to enconde for modeling.Location :
Year :
Kilometers_driven :
Fuel_Type :
Transmission :
Owner :
Mileage :
Power.Engine :
Power which is to be expected.Mileage.Power :
Seats :
New_Price :
Price : The price of the used car in INR Lakhs (1 Lakh = $100, 000 INR)
Power.The data is heavily right-skewed.
Brand, Model and Car_category: has an obvious positive correlation with Price.
Our main insights were:
1. The `Price` estimate of central tendency (the mean (or other estimator) value) of used cars is much higher in Coimbatore and Bangalore than other locations.
2. `Price` for Petrol vehicles are many outside of the mean in the positive direction.
3. Used cars with a Third owner tend to be before 2015 and have lower prices than other owners.
Encoding Categoricals
Notes from GreatLearning:
"Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not.
Sometimes we want to one-hot encode where every level gets an indicator, but other times we want to drop one level to account for the fact that in a model we will have an intercept."
This is worth doing at the very end because for many earlier analysis tasks (e.g. visualization and value counts) it will be more convenient to keep categorical variables as a single column. Additionally, we'll want to make sure missing values are resolved by this point.
ind_vars = df.select_dtypes(include=["object", "category"]).columns.tolist()
ind_vars
['Location', 'Fuel_Type', 'Transmission', 'Owner_Type', 'Brand', 'Model', 'Car_category']
With 217 Models to sort through I'm going to not take the effort of encoding them all and go ahead and drop that for our model. We still have Brand and Bins to help us make sense of the specific cars.
df.drop("Model", axis=1, inplace=True)
# get the size of the current df vs the original
(len(df) / 7253) * 100
99.90348821177444
# generate binary values using get_dummies dropping the 1st one
def encode_cat_vars(x):
x = pd.get_dummies(
x,
columns=x.select_dtypes(include=["object", "category"]).columns.tolist(),
drop_first=True,
)
return x
# apply function
ind_vars_num = encode_cat_vars(df)
ind_vars_num.head()
| Year | Kilometers_Driven | Power | Seats | Price | Engine_num | Mileage_num | price_log | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Fuel_Type_Diesel | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Brand_Audi | Brand_Bentley | Brand_Bmw | Brand_Chevrolet | Brand_Datsun | Brand_Fiat | Brand_Force | Brand_Ford | Brand_Hindustan | Brand_Honda | Brand_Hyundai | Brand_Isuzu | Brand_Jaguar | Brand_Jeep | Brand_Lamborghini | Brand_Land | Brand_Mahindra | Brand_Maruti | Brand_Mercedes-Benz | Brand_Mini | Brand_Mitsubishi | Brand_Nissan | Brand_Opelcorsa | Brand_Porsche | Brand_Renault | Brand_Skoda | Brand_Smart | Brand_Tata | Brand_Toyota | Brand_Volkswagen | Brand_Volvo | Car_category_Mid_Range | Car_category_Luxury_Cars | Car_category_Ultra_luxury | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 0 | 2010 | 72000 | 58.16 | 5 | 1.75 | 998 | 26.60 | 0.56 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2015 | 41000 | 126.20 | 5 | 12.50 | 1582 | 19.67 | 2.53 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2011 | 46000 | 88.70 | 5 | 4.50 | 1199 | 18.20 | 1.50 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2012 | 87000 | 88.76 | 7 | 6.00 | 1248 | 20.77 | 1.79 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2013 | 40670 | 140.80 | 5 | 17.74 | 1968 | 15.20 | 2.88 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
# check the df shape
ind_vars_num.shape
(7246, 59)
# get the size of the current df vs the original
(len(ind_vars_num) / 7253) * 100
99.90348821177444
# assign an object by another object
df = ind_vars_num
# get the size of the current df vs the original
(len(df) / 7253) * 100
99.90348821177444
We have ~60 columns to model with. We will probably cut this down considerably. We still have the majority of our data intact.
Split the data into train and test (can split training into Train and Validation since you do not report on the Training but only on the Test set.
At this point I will drop Price to look at price_log against each dependent variable.
# drop a column
df.drop("Price", axis=1, inplace=True)
# lets build our linear model
# independent variable by dropping out the Price variable
X = df.drop(["price_log"], axis=1)
# the dependent variable = Price
y = df[["price_log"]]
display(y.head())
| price_log | |
|---|---|
| S.No. | |
| 0 | 0.56 |
| 1 | 2.53 |
| 2 | 1.50 |
| 3 | 1.79 |
| 4 | 2.88 |
print(X.shape)
print(y.shape)
(7246, 57) (7246, 1)
# Split X and y into training and test set in 70:30 ratio
# 70/30 split + psuedo randomness using random-seed to make a sort of scientific sample
# training and test set sizes should be equal
x_train, x_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=1
)
# print the size of each x split
print("Number of rows in train data =", x_train.shape[0])
print("Number of rows in train data =", x_test.shape[0])
Number of rows in train data = 5072 Number of rows in train data = 2174
# check the head
x_train.head()
| Year | Kilometers_Driven | Power | Seats | Engine_num | Mileage_num | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Fuel_Type_Diesel | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Brand_Audi | Brand_Bentley | Brand_Bmw | Brand_Chevrolet | Brand_Datsun | Brand_Fiat | Brand_Force | Brand_Ford | Brand_Hindustan | Brand_Honda | Brand_Hyundai | Brand_Isuzu | Brand_Jaguar | Brand_Jeep | Brand_Lamborghini | Brand_Land | Brand_Mahindra | Brand_Maruti | Brand_Mercedes-Benz | Brand_Mini | Brand_Mitsubishi | Brand_Nissan | Brand_Opelcorsa | Brand_Porsche | Brand_Renault | Brand_Skoda | Brand_Smart | Brand_Tata | Brand_Toyota | Brand_Volkswagen | Brand_Volvo | Car_category_Mid_Range | Car_category_Luxury_Cars | Car_category_Ultra_luxury | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 2080 | 2014 | 71000 | 126.32 | 5 | 1582 | 19.08 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 62 | 2015 | 58000 | 186.00 | 5 | 1796 | 11.74 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 6978 | 2008 | 131000 | 105.00 | 5 | 1896 | 15.00 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3502 | 2015 | 54339 | 88.73 | 5 | 1396 | 22.54 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2156 | 2014 | 105044 | 100.60 | 7 | 2494 | 12.99 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
# is a best practice keep code minimal, can change the function that the object is holding but then the rest can run smoothly
# gives LinearRegression model
linearregression = LinearRegression() # build a regression model from scikitlearn
# fitting the linear regression model on the train data (70% of the whole data)
linearregression.fit(x_train, y_train)
LinearRegression()
# create the intercept
intercept = linearregression.intercept_[0]
print("The intercept for our model is {}".format(intercept))
The intercept for our model is -181.12333731112986
# create coefficients and intercept on dataframe
coef_df = pd.DataFrame(
np.append(linearregression.coef_[0], linearregression.intercept_[0]),
index=x_train.columns.tolist() + ["Intercept"],
columns=["Coefficients"],
)
coef_df
| Coefficients | |
|---|---|
| Year | 0.09 |
| Kilometers_Driven | -0.00 |
| Power | 0.00 |
| Seats | 0.01 |
| Engine_num | 0.00 |
| Mileage_num | -0.01 |
| Location_Bangalore | 0.07 |
| Location_Chennai | 0.00 |
| Location_Coimbatore | 0.05 |
| Location_Delhi | -0.05 |
| Location_Hyderabad | 0.06 |
| Location_Jaipur | -0.08 |
| Location_Kochi | -0.04 |
| Location_Kolkata | -0.14 |
| Location_Mumbai | 0.00 |
| Location_Pune | -0.04 |
| Fuel_Type_Diesel | 0.27 |
| Fuel_Type_LPG | 0.15 |
| Fuel_Type_Petrol | 0.07 |
| Transmission_Manual | -0.08 |
| Owner_Type_Fourth & Above | 0.10 |
| Owner_Type_Second | -0.05 |
| Owner_Type_Third | -0.15 |
| Brand_Audi | 0.41 |
| Brand_Bentley | 0.22 |
| Brand_Bmw | 0.49 |
| Brand_Chevrolet | 0.00 |
| Brand_Datsun | -0.03 |
| Brand_Fiat | 0.11 |
| Brand_Force | 0.59 |
| Brand_Ford | 0.19 |
| Brand_Hindustan | 0.00 |
| Brand_Honda | 0.34 |
| Brand_Hyundai | 0.27 |
| Brand_Isuzu | -0.20 |
| Brand_Jaguar | 0.33 |
| Brand_Jeep | 0.14 |
| Brand_Lamborghini | 0.00 |
| Brand_Land | 0.57 |
| Brand_Mahindra | 0.20 |
| Brand_Maruti | 0.26 |
| Brand_Mercedes-Benz | 0.44 |
| Brand_Mini | 0.60 |
| Brand_Mitsubishi | 0.31 |
| Brand_Nissan | 0.27 |
| Brand_Opelcorsa | 1.20 |
| Brand_Porsche | 0.26 |
| Brand_Renault | 0.27 |
| Brand_Skoda | 0.31 |
| Brand_Smart | 0.30 |
| Brand_Tata | -0.16 |
| Brand_Toyota | 0.35 |
| Brand_Volkswagen | 0.25 |
| Brand_Volvo | 0.24 |
| Car_category_Mid_Range | 0.70 |
| Car_category_Luxury_Cars | 0.98 |
| Car_category_Ultra_luxury | 1.19 |
| Intercept | -181.12 |
We will now perform linear regression using statsmodels, a Python module that provides functions for the estimation of many statistical models, as well as for conducting statistical tests, and statistical data exploration.
Using statsmodels, we will be able to check the statistical validity of our model.
# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train = sm.add_constant(x_train)
# Add constant to test data
x_test = sm.add_constant(x_test)
def build_ols_model(train):
# Create the model
olsmodel = sm.OLS(y_train["price_log"], train)
return olsmodel.fit()
olsmodel = build_ols_model(x_train)
print(olsmodel.summary())
OLS Regression Results
==============================================================================
Dep. Variable: price_log R-squared: 0.825
Model: OLS Adj. R-squared: 0.823
Method: Least Squares F-statistic: 429.1
Date: Fri, 20 Aug 2021 Prob (F-statistic): 0.00
Time: 06:55:51 Log-Likelihood: -1622.4
No. Observations: 5072 AIC: 3357.
Df Residuals: 5016 BIC: 3723.
Df Model: 55
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -181.1233 4.129 -43.867 0.000 -189.218 -173.029
Year 0.0905 0.002 44.027 0.000 0.086 0.094
Kilometers_Driven -1.278e-07 1.56e-07 -0.819 0.413 -4.34e-07 1.78e-07
Power 0.0022 0.000 8.408 0.000 0.002 0.003
Seats 0.0079 0.009 0.846 0.397 -0.010 0.026
Engine_num 0.0002 2.5e-05 6.985 0.000 0.000 0.000
Mileage_num -0.0142 0.002 -6.406 0.000 -0.019 -0.010
Location_Bangalore 0.0686 0.031 2.188 0.029 0.007 0.130
Location_Chennai 0.0016 0.029 0.055 0.956 -0.056 0.059
Location_Coimbatore 0.0521 0.029 1.822 0.068 -0.004 0.108
Location_Delhi -0.0456 0.029 -1.580 0.114 -0.102 0.011
Location_Hyderabad 0.0641 0.028 2.303 0.021 0.010 0.119
Location_Jaipur -0.0785 0.030 -2.586 0.010 -0.138 -0.019
Location_Kochi -0.0412 0.029 -1.443 0.149 -0.097 0.015
Location_Kolkata -0.1416 0.029 -4.873 0.000 -0.199 -0.085
Location_Mumbai 0.0017 0.028 0.060 0.952 -0.053 0.056
Location_Pune -0.0396 0.028 -1.393 0.164 -0.095 0.016
Fuel_Type_Diesel 0.2668 0.051 5.269 0.000 0.168 0.366
Fuel_Type_LPG 0.1452 0.137 1.061 0.289 -0.123 0.413
Fuel_Type_Petrol 0.0691 0.052 1.338 0.181 -0.032 0.170
Transmission_Manual -0.0779 0.016 -4.910 0.000 -0.109 -0.047
Owner_Type_Fourth & Above 0.1026 0.113 0.909 0.363 -0.119 0.324
Owner_Type_Second -0.0519 0.014 -3.709 0.000 -0.079 -0.024
Owner_Type_Third -0.1546 0.035 -4.425 0.000 -0.223 -0.086
Brand_Audi 0.4112 0.339 1.212 0.226 -0.254 1.076
Brand_Bentley 0.2232 0.485 0.460 0.646 -0.728 1.175
Brand_Bmw 0.4911 0.340 1.446 0.148 -0.175 1.157
Brand_Chevrolet 0.0012 0.339 0.004 0.997 -0.664 0.666
Brand_Datsun -0.0279 0.350 -0.080 0.936 -0.714 0.658
Brand_Fiat 0.1097 0.344 0.319 0.750 -0.564 0.784
Brand_Force 0.5887 0.413 1.426 0.154 -0.221 1.398
Brand_Ford 0.1873 0.338 0.554 0.580 -0.476 0.850
Brand_Hindustan 3.501e-13 1.8e-13 1.948 0.051 -2.25e-15 7.03e-13
Brand_Honda 0.3377 0.338 0.998 0.318 -0.326 1.001
Brand_Hyundai 0.2732 0.338 0.808 0.419 -0.389 0.936
Brand_Isuzu -0.1953 0.378 -0.517 0.605 -0.935 0.545
Brand_Jaguar 0.3332 0.344 0.967 0.333 -0.342 1.008
Brand_Jeep 0.1390 0.350 0.397 0.691 -0.548 0.826
Brand_Lamborghini -1.246e-13 6.59e-14 -1.891 0.059 -2.54e-13 4.6e-15
Brand_Land 0.5720 0.342 1.675 0.094 -0.098 1.242
Brand_Mahindra 0.1977 0.339 0.584 0.559 -0.466 0.862
Brand_Maruti 0.2595 0.338 0.768 0.443 -0.403 0.922
Brand_Mercedes-Benz 0.4446 0.339 1.312 0.190 -0.220 1.109
Brand_Mini 0.5984 0.346 1.728 0.084 -0.080 1.277
Brand_Mitsubishi 0.3121 0.343 0.910 0.363 -0.360 0.984
Brand_Nissan 0.2680 0.340 0.789 0.430 -0.398 0.934
Brand_Opelcorsa 1.1978 0.476 2.517 0.012 0.265 2.131
Brand_Porsche 0.2640 0.351 0.751 0.453 -0.425 0.953
Brand_Renault 0.2650 0.339 0.781 0.435 -0.400 0.930
Brand_Skoda 0.3118 0.339 0.920 0.358 -0.353 0.976
Brand_Smart 0.2993 0.477 0.627 0.530 -0.636 1.234
Brand_Tata -0.1567 0.339 -0.463 0.644 -0.821 0.507
Brand_Toyota 0.3455 0.338 1.021 0.307 -0.318 1.009
Brand_Volkswagen 0.2517 0.338 0.744 0.457 -0.411 0.915
Brand_Volvo 0.2360 0.347 0.680 0.497 -0.444 0.916
Car_category_Mid_Range 0.7006 0.022 32.076 0.000 0.658 0.743
Car_category_Luxury_Cars 0.9806 0.031 31.305 0.000 0.919 1.042
Car_category_Ultra_luxury 1.1911 0.051 23.558 0.000 1.092 1.290
==============================================================================
Omnibus: 192.923 Durbin-Watson: 2.012
Prob(Omnibus): 0.000 Jarque-Bera (JB): 554.732
Skew: -0.100 Prob(JB): 3.48e-121
Kurtosis: 4.608 Cond. No. 1.20e+16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 1.71e-19. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
Notes from GreatLearning:
Brand, Engine, Seats and so on) coefficients are zero, then the expected output (i.e., Y) would be equal to the const coefficient.Negative values of the coefficient show that Price decreases with the increase of corresponding attribute value.
These are: Seats, Kilometers_Driven, and others.
Positive values of the coefficient show that Price increases with the increase of corresponding attribute value.
These are: Year,Mileage`, and others.
p-value of a variable indicates if the variable is significant or not. If we consider the significance level to be 0.05 (5%), then any variable with a p-value less than 0.05 would be considered significant.
For each independent feature, there is a null hypothesis and an alternate hypothesis. Here $\beta_i$ is the coefficient of the $i$th independent variable.
(P>|t|) gives the p-value for each independent feature to check that null hypothesis. We are considering 0.05 (5%) as significance level.
Observations:
We have build a linear regression model which shows multicolinearity and is highly accurate.
But this is not our final model. We have to check the statistical validity of the model, and also make sure it satisfies the assumptions of linear regression.
Evaluate the model on different performance metrics and comment on the performance and scope of improvement
Let's check the performance of the model using different metrics (MAE, MAPE, RMSE, $R^2$).
# defining function for MAPE from GreatLearning
import math
# RMSE
def rmse(predictions, targets):
return np.sqrt(((targets - predictions) ** 2).mean())
# MAPE
def mape(predictions, targets):
return np.mean(np.abs((targets - predictions)) / targets) * 100
# MAE
def mae(predictions, targets):
return np.mean(np.abs((targets - predictions)))
# Model Performance on test and train data
def model_pref(olsmodel, x_train, x_test):
# Insample Prediction
y_pred_train_pricelog = olsmodel.predict(x_train)
y_pred_train_Price = y_pred_train_pricelog
y_train_Price = y_train["price_log"]
# Prediction on test data
y_pred_test_pricelog = olsmodel.predict(x_test)
y_pred_test_Price = y_pred_test_pricelog
y_test_Price = y_test["price_log"]
print(
pd.DataFrame(
{
"Data": ["Train", "Test"],
"RMSE": [
rmse(y_pred_train_Price, y_train_Price),
rmse(y_pred_test_Price, y_test_Price),
],
"MAE": [
mae(y_pred_train_Price, y_train_Price),
mae(y_pred_test_Price, y_test_Price),
],
"MAPE": [
mape(y_pred_train_Price, y_train_Price),
mape(y_pred_test_Price, y_test_Price),
],
}
)
)
# Checking model performance
model_pref(olsmodel, x_train, x_test)
Data RMSE MAE MAPE 0 Train 0.33 0.25 inf 1 Test 0.33 0.25 inf
Observations
Root Mean Squared Error of train and test data is slightly different, indicating that our model is overfitting the test data.
Mean Absolute Error indicates that our current model is able to predict used cars prices within mean error of .25(log-odds) lakhs on test data.
Mean Absolute Percentage Error is inf because it is dividing by 0 in some cases.
The units of both RMSE and MAE are same - Lakhs in this case. But RMSE is greater than MAE because it peanalises the outliers more.
We need to deal with multicollinearity and check the other assumptions of linear regression first, and then look at the p-values.
Perform tests for the assumptions of the linear regression. We will be checking the following Linear Regression assumptions to prevent error in inference or insights:
1. No Multicollinearity (correlation is too tight between two variables and inflates the co-efficient)
2. Mean of residuals should be 0
3. No Heteroscedasticity
4. Linearity of variables
5. Normality of error terms (important for calculating co-efficients)
def checking_vif(train):
vif = pd.DataFrame()
vif["feature"] = train.columns
# calculating VIF for each feature
vif["VIF"] = [
variance_inflation_factor(train.values, i) for i in range(len(train.columns))
]
return vif
# print(checking_vif(x_train))
Features having a VIF score >5 will be dropped/treated till all the features have a VIF score <5
# Check VIF
checking_vif(x_train)
| feature | VIF | |
|---|---|---|
| 0 | const | 770317.64 |
| 1 | Year | 2.04 |
| 2 | Kilometers_Driven | 1.62 |
| 3 | Power | 8.78 |
| 4 | Seats | 2.53 |
| 5 | Engine_num | 10.05 |
| 6 | Mileage_num | 3.89 |
| 7 | Location_Bangalore | 2.42 |
| 8 | Location_Chennai | 3.03 |
| 9 | Location_Coimbatore | 3.48 |
| 10 | Location_Delhi | 3.10 |
| 11 | Location_Hyderabad | 3.76 |
| 12 | Location_Jaipur | 2.70 |
| 13 | Location_Kochi | 3.48 |
| 14 | Location_Kolkata | 3.10 |
| 15 | Location_Mumbai | 3.91 |
| 16 | Location_Pune | 3.52 |
| 17 | Fuel_Type_Diesel | 28.85 |
| 18 | Fuel_Type_LPG | 1.17 |
| 19 | Fuel_Type_Petrol | 29.89 |
| 20 | Transmission_Manual | 2.29 |
| 21 | Owner_Type_Fourth & Above | 1.02 |
| 22 | Owner_Type_Second | 1.19 |
| 23 | Owner_Type_Third | 1.14 |
| 24 | Brand_Audi | 187.56 |
| 25 | Brand_Bentley | 2.10 |
| 26 | Brand_Bmw | 219.86 |
| 27 | Brand_Chevrolet | 100.45 |
| 28 | Brand_Datsun | 15.23 |
| 29 | Brand_Fiat | 27.25 |
| 30 | Brand_Force | 3.04 |
| 31 | Brand_Ford | 238.45 |
| 32 | Brand_Hindustan | NaN |
| 33 | Brand_Honda | 469.29 |
| 34 | Brand_Hyundai | 763.84 |
| 35 | Brand_Isuzu | 5.08 |
| 36 | Brand_Jaguar | 31.52 |
| 37 | Brand_Jeep | 15.26 |
| 38 | Brand_Lamborghini | NaN |
| 39 | Brand_Land | 52.47 |
| 40 | Brand_Mahindra | 215.02 |
| 41 | Brand_Maruti | 841.10 |
| 42 | Brand_Mercedes-Benz | 254.28 |
| 43 | Brand_Mini | 22.33 |
| 44 | Brand_Mitsubishi | 32.25 |
| 45 | Brand_Nissan | 83.89 |
| 46 | Brand_Opelcorsa | 2.02 |
| 47 | Brand_Porsche | 16.46 |
| 48 | Brand_Renault | 112.33 |
| 49 | Brand_Skoda | 157.59 |
| 50 | Brand_Smart | 2.03 |
| 51 | Brand_Tata | 158.35 |
| 52 | Brand_Toyota | 337.59 |
| 53 | Brand_Volkswagen | 253.25 |
| 54 | Brand_Volvo | 20.31 |
| 55 | Car_category_Mid_Range | 1.66 |
| 56 | Car_category_Luxury_Cars | 1.57 |
| 57 | Car_category_Ultra_luxury | 1.42 |
Engine, others, and Power has a VIF score of much greater than 5. Clearly these 2 variables are correlated with each other.Brand have lots of multicolinearity.
This does seem to make intuitive sense because the number of CC's and bhp would have a significant overlap as size is related to horsepower since a large engine should in theory produce more power.I will need to drop some columns to make this handle correctly.
Since Brand is binned by economic categories that we created earlier lets drop them and create a lighter model to test.
# get a list of columns that do not have "Brand" in the title
cols = [c for c in df.columns if c.title()[:5] != "Brand"]
# create a new dataframe and read the head
df1 = df[cols]
df1.head()
| Year | Kilometers_Driven | Power | Seats | Engine_num | Mileage_num | price_log | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Fuel_Type_Diesel | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Car_category_Mid_Range | Car_category_Luxury_Cars | Car_category_Ultra_luxury | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||||||||||||||||
| 0 | 2010 | 72000 | 58.16 | 5 | 998 | 26.60 | 0.56 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2015 | 41000 | 126.20 | 5 | 1582 | 19.67 | 2.53 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2011 | 46000 | 88.70 | 5 | 1199 | 18.20 | 1.50 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 2012 | 87000 | 88.76 | 7 | 1248 | 20.77 | 1.79 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2013 | 40670 | 140.80 | 5 | 1968 | 15.20 | 2.88 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
# We will have to create the x and y datasets again
ind_vars = df1.drop(["price_log"], axis=1)
dep_var = df1[["price_log"]]
# Dummy encoding
ind_vars_num = encode_cat_vars(ind_vars)
# Splitting data into train and test
x_train1, x_test1, y_train1, y_test1 = train_test_split(
ind_vars_num, dep_var, test_size=0.3, random_state=1
)
print("Number of rows in train data =", x_train1.shape[0])
print("Number of rows in train data =", x_test1.shape[0], "\n\n")
# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train1 = sm.add_constant(x_train1)
# Add constant to test data
x_test1 = sm.add_constant(x_test1)
# Fit linear model on new dataset and print the summary()
olsmodel1 = build_ols_model(x_train1)
print(olsmodel1.summary())
Number of rows in train data = 5072
Number of rows in train data = 2174
OLS Regression Results
==============================================================================
Dep. Variable: price_log R-squared: 0.808
Model: OLS Adj. R-squared: 0.807
Method: Least Squares F-statistic: 816.8
Date: Fri, 20 Aug 2021 Prob (F-statistic): 0.00
Time: 06:55:53 Log-Likelihood: -1852.7
No. Observations: 5072 AIC: 3759.
Df Residuals: 5045 BIC: 3936.
Df Model: 26
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -173.6506 4.151 -41.837 0.000 -181.788 -165.514
Year 0.0868 0.002 41.952 0.000 0.083 0.091
Kilometers_Driven -1.178e-07 1.61e-07 -0.733 0.463 -4.33e-07 1.97e-07
Power 0.0030 0.000 12.433 0.000 0.003 0.003
Seats 0.0057 0.008 0.679 0.497 -0.011 0.022
Engine_num 0.0002 2.34e-05 7.730 0.000 0.000 0.000
Mileage_num -0.0111 0.002 -5.220 0.000 -0.015 -0.007
Location_Bangalore 0.0652 0.033 1.999 0.046 0.001 0.129
Location_Chennai -0.0180 0.031 -0.586 0.558 -0.078 0.042
Location_Coimbatore 0.0348 0.030 1.170 0.242 -0.024 0.093
Location_Delhi -0.0365 0.030 -1.216 0.224 -0.095 0.022
Location_Hyderabad 0.0547 0.029 1.887 0.059 -0.002 0.111
Location_Jaipur -0.1029 0.032 -3.260 0.001 -0.165 -0.041
Location_Kochi -0.0409 0.030 -1.378 0.168 -0.099 0.017
Location_Kolkata -0.1574 0.030 -5.206 0.000 -0.217 -0.098
Location_Mumbai -0.0005 0.029 -0.017 0.986 -0.057 0.056
Location_Pune -0.0604 0.030 -2.044 0.041 -0.118 -0.002
Fuel_Type_Diesel 0.2572 0.052 4.901 0.000 0.154 0.360
Fuel_Type_LPG 0.1854 0.143 1.300 0.194 -0.094 0.465
Fuel_Type_Petrol 0.0802 0.053 1.501 0.133 -0.025 0.185
Transmission_Manual -0.1355 0.015 -8.852 0.000 -0.166 -0.106
Owner_Type_Fourth & Above 0.0585 0.118 0.498 0.619 -0.172 0.289
Owner_Type_Second -0.0587 0.015 -4.038 0.000 -0.087 -0.030
Owner_Type_Third -0.1795 0.036 -4.957 0.000 -0.250 -0.109
Car_category_Mid_Range 0.7581 0.021 35.621 0.000 0.716 0.800
Car_category_Luxury_Cars 1.0504 0.031 34.280 0.000 0.990 1.110
Car_category_Ultra_luxury 1.2135 0.050 24.171 0.000 1.115 1.312
==============================================================================
Omnibus: 254.478 Durbin-Watson: 2.009
Prob(Omnibus): 0.000 Jarque-Bera (JB): 708.716
Skew: -0.242 Prob(JB): 1.27e-154
Kurtosis: 4.766 Cond. No. 5.89e+07
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.89e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
The R squared and adjusted r squared values have decreased which indicates that we have been able to capture less of the information of the previous model after reducing the number of predictor features.
As we try to decrease overfitting, the r squared of our train model is expected to decrease.
# Check VIF
print(checking_vif(x_train1))
feature VIF 0 const 714972.56 1 Year 1.90 2 Kilometers_Driven 1.57 3 Power 6.53 4 Seats 1.89 5 Engine_num 8.10 6 Mileage_num 3.26 7 Location_Bangalore 2.41 8 Location_Chennai 3.00 9 Location_Coimbatore 3.46 10 Location_Delhi 3.09 11 Location_Hyderabad 3.73 12 Location_Jaipur 2.67 13 Location_Kochi 3.46 14 Location_Kolkata 3.09 15 Location_Mumbai 3.88 16 Location_Pune 3.50 17 Fuel_Type_Diesel 28.45 18 Fuel_Type_LPG 1.16 19 Fuel_Type_Petrol 29.39 20 Transmission_Manual 1.96 21 Owner_Type_Fourth & Above 1.02 22 Owner_Type_Second 1.18 23 Owner_Type_Third 1.12 24 Car_category_Mid_Range 1.45 25 Car_category_Luxury_Cars 1.38 26 Car_category_Ultra_luxury 1.28
We have a large number of high VIF scores that need treatment.
# Checking model performance
model_pref(olsmodel1, x_train1, x_test1) # No Overfitting.
Data RMSE MAE MAPE 0 Train 0.35 0.26 inf 1 Test 0.33 0.26 inf
Engine_num and others¶To remove multicollinearity
Our highest VIF offenders are in order:
Fuel_Type_Petrol 29.39
Fuel_Type_Diesel 28.45
Engine_num 8.10
# get a column of column names
high_vif_columns = [
"Engine_num",
"Fuel_Type_Diesel",
"Fuel_Type_Petrol",
]
# input the last model's test and train df's into the function
treating_multicollinearity(high_vif_columns, x_train1, x_test1)
col Adj_rsq_after_dropping_col Test RMSE 2 Fuel_Type_Petrol 0.81 0.33 1 Fuel_Type_Diesel 0.81 0.34 0 Engine_num 0.80 0.34
Our Adjusted R-squared stays the same if we remove Engine. This is a clear indication that we have been able to create a very good model that is able to explain variance in price of used cars with up to 81% accurary.
However, RMSE has increased if we remove Fuel Type meaning the gap in our residuals has grown and we want this to be as low as possible to be accurate in how the model predicts the response.
"Lower values of RMSE indicate better fit. RMSE is a good measure of how accurately the model predicts the response, and it is the most important criterion for fit if the main purpose of the model is prediction."Source
Let's drop one just to see.
DropFuel_Type_Petrol
# we drop the one with the highest vif values and check the adjusted R-squared
x_train2 = x_train1.drop("Fuel_Type_Petrol", axis=1)
x_train2.head()
| const | Year | Kilometers_Driven | Power | Seats | Engine_num | Mileage_num | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Fuel_Type_Diesel | Fuel_Type_LPG | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Car_category_Mid_Range | Car_category_Luxury_Cars | Car_category_Ultra_luxury | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | ||||||||||||||||||||||||||
| 2080 | 1.00 | 2014 | 71000 | 126.32 | 5 | 1582 | 19.08 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 62 | 1.00 | 2015 | 58000 | 186.00 | 5 | 1796 | 11.74 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 6978 | 1.00 | 2008 | 131000 | 105.00 | 5 | 1896 | 15.00 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3502 | 1.00 | 2015 | 54339 | 88.73 | 5 | 1396 | 22.54 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2156 | 1.00 | 2014 | 105044 | 100.60 | 7 | 2494 | 12.99 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
# create VIF scores
vif_series2 = pd.Series(
[variance_inflation_factor(x_train2.values, i) for i in range(x_train2.shape[1])],
index=x_train2.columns,
)
print("VIF Scores: \n\n{}\n".format(vif_series2))
VIF Scores: const 714843.18 Year 1.90 Kilometers_Driven 1.57 Power 6.52 Seats 1.89 Engine_num 7.99 Mileage_num 3.08 Location_Bangalore 2.41 Location_Chennai 3.00 Location_Coimbatore 3.46 Location_Delhi 3.09 Location_Hyderabad 3.73 Location_Jaipur 2.67 Location_Kochi 3.46 Location_Kolkata 3.09 Location_Mumbai 3.88 Location_Pune 3.50 Fuel_Type_Diesel 2.11 Fuel_Type_LPG 1.01 Transmission_Manual 1.96 Owner_Type_Fourth & Above 1.02 Owner_Type_Second 1.18 Owner_Type_Third 1.12 Car_category_Mid_Range 1.45 Car_category_Luxury_Cars 1.38 Car_category_Ultra_luxury 1.28 dtype: float64
We have lowered the Fuel_Type_Diesel VIF and brought them closer to 5. Now we have Power and Engine_num to handle.
Create Model 3 and score
# we drop the one with the highest vif values and check the adjusted R-squared
high_vif_columns = [
"Engine_num",
# "Fuel_Type_Diesel",
"Fuel_Type_Petrol",
"Power",
]
# create a new dataframe and read the head
df2 = df1.drop(high_vif_columns, axis=1)
df2.head()
# We will have to create the x and y datasets again
ind_vars = df2.drop(["price_log"], axis=1)
dep_var = df2[["price_log"]]
# Dummy encoding
ind_vars_num = encode_cat_vars(ind_vars)
# Splitting data into train and test
x_train3, x_test3, y_train3, y_test3 = train_test_split(
ind_vars_num, dep_var, test_size=0.3, random_state=1
)
# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train3 = sm.add_constant(x_train3)
# Add constant to test data
x_test3 = sm.add_constant(x_test3)
# create VIF scores
vif_series3 = pd.Series(
[variance_inflation_factor(x_train3.values, i) for i in range(x_train3.shape[1])],
index=x_train3.columns,
)
print("VIF Scores: \n\n{}\n".format(vif_series3))
VIF Scores: const 714213.66 Year 1.89 Kilometers_Driven 1.56 Seats 1.58 Mileage_num 2.01 Location_Bangalore 2.41 Location_Chennai 3.00 Location_Coimbatore 3.46 Location_Delhi 3.09 Location_Hyderabad 3.73 Location_Jaipur 2.67 Location_Kochi 3.45 Location_Kolkata 3.08 Location_Mumbai 3.88 Location_Pune 3.49 Fuel_Type_Diesel 1.56 Fuel_Type_LPG 1.01 Transmission_Manual 1.68 Owner_Type_Fourth & Above 1.01 Owner_Type_Second 1.18 Owner_Type_Third 1.12 Car_category_Mid_Range 1.38 Car_category_Luxury_Cars 1.26 Car_category_Ultra_luxury 1.12 dtype: float64
We passed the Multicolinearity Test and can create our new model.
print("Number of rows in train data =", x_train3.shape[0])
print("Number of rows in train data =", x_test3.shape[0], "\n\n")
# Fit linear model on new dataset and print the summary()
olsmodel2 = build_ols_model(x_train3)
print(olsmodel2.summary())
Number of rows in train data = 5072
Number of rows in train data = 2174
OLS Regression Results
==============================================================================
Dep. Variable: price_log R-squared: 0.781
Model: OLS Adj. R-squared: 0.780
Method: Least Squares F-statistic: 782.9
Date: Fri, 20 Aug 2021 Prob (F-statistic): 0.00
Time: 06:55:54 Log-Likelihood: -2186.4
No. Observations: 5072 AIC: 4421.
Df Residuals: 5048 BIC: 4577.
Df Model: 23
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -175.8018 4.429 -39.692 0.000 -184.485 -167.119
Year 0.0885 0.002 40.108 0.000 0.084 0.093
Kilometers_Driven 8.531e-09 1.71e-07 0.050 0.960 -3.27e-07 3.44e-07
Seats 0.0001 0.008 0.012 0.990 -0.016 0.016
Mileage_num -0.0410 0.002 -23.040 0.000 -0.044 -0.037
Location_Bangalore 0.0625 0.035 1.795 0.073 -0.006 0.131
Location_Chennai -0.0400 0.033 -1.224 0.221 -0.104 0.024
Location_Coimbatore 0.0157 0.032 0.496 0.620 -0.046 0.078
Location_Delhi -0.0258 0.032 -0.803 0.422 -0.089 0.037
Location_Hyderabad 0.0330 0.031 1.067 0.286 -0.028 0.094
Location_Jaipur -0.1370 0.034 -4.073 0.000 -0.203 -0.071
Location_Kochi -0.0344 0.032 -1.085 0.278 -0.097 0.028
Location_Kolkata -0.1577 0.032 -4.887 0.000 -0.221 -0.094
Location_Mumbai 0.0063 0.031 0.204 0.838 -0.054 0.067
Location_Pune -0.0892 0.032 -2.830 0.005 -0.151 -0.027
Fuel_Type_Diesel 0.3395 0.013 25.876 0.000 0.314 0.365
Fuel_Type_LPG 0.0739 0.142 0.521 0.602 -0.204 0.352
Transmission_Manual -0.2747 0.015 -18.161 0.000 -0.304 -0.245
Owner_Type_Fourth & Above -0.0110 0.125 -0.088 0.930 -0.257 0.235
Owner_Type_Second -0.0656 0.016 -4.226 0.000 -0.096 -0.035
Owner_Type_Third -0.1759 0.039 -4.551 0.000 -0.252 -0.100
Car_category_Mid_Range 0.8769 0.022 39.618 0.000 0.834 0.920
Car_category_Luxury_Cars 1.2727 0.031 40.585 0.000 1.211 1.334
Car_category_Ultra_luxury 1.6833 0.050 33.613 0.000 1.585 1.781
==============================================================================
Omnibus: 115.791 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 202.525
Skew: -0.188 Prob(JB): 1.05e-44
Kurtosis: 3.904 Cond. No. 5.88e+07
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.88e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
# Checking model performance
model_pref(olsmodel2, x_train3, x_test3) # No Overfitting.
Data RMSE MAE MAPE 0 Train 0.37 0.29 inf 1 Test 0.37 0.29 inf
Adj. R-squared: decreased 2% from the previous model as expected but is still a respectable 78% accuracy score.
RMSE:: Increased on training and test saying that our fit has decreased. The training and test sets have moved closer to each other indicating that the model is not overfitting the training data anymore. But RMSE is greater than MAE because it peanalises the outliers more.
MAE:: Increased on training and test saying that our fit has decreased. Meaning that our current model is able to predict used cars prices within mean error of 29% lakhs on test data.
MAPE: Has 0's in it so it returns an error.
Remove those variable where p-value is above 0.05 and re-run the model, thus making the A-R^2 GO UP (reduced error).
# Drop 'Kilometers_Driven_log' from train and test
cols_to_drop = ["Kilometers_Driven","Location_Hyderabad","Fuel_Type_LPG","Location_Mumbai","Owner_Type_Fourth & Above","Location_Kochi","Location_Bangalore", "Location_Chennai", "Location_Coimbatore", "Location_Delhi", "Mileage_num"] #"Kilometers_Driven_log"
# we drop the one with the highest vif values and check the adjusted R-squared
x_train4 = x_train3.drop(cols_to_drop, axis=1)
# lock the new data frame
#x_train3 = x_train2.loc[:, ~x_train2.columns.str.startswith(col_to_drop)]
#x_test3 = x_test2.loc[:, ~x_test2.columns.str.startswith(col_to_drop)]
x_train4.head()
| const | Year | Seats | Location_Jaipur | Location_Kolkata | Location_Pune | Fuel_Type_Diesel | Transmission_Manual | Owner_Type_Second | Owner_Type_Third | Car_category_Mid_Range | Car_category_Luxury_Cars | Car_category_Ultra_luxury | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 2080 | 1.00 | 2014 | 5 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 62 | 1.00 | 2015 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 6978 | 1.00 | 2008 | 5 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3502 | 1.00 | 2015 | 5 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 2156 | 1.00 | 2014 | 7 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
# create VIF scores
vif_series3 = pd.Series(
[variance_inflation_factor(x_train4.values, i) for i in range(x_train4.shape[1])],
index=x_train4.columns,
)
print("VIF Scores: \n\n{}\n".format(vif_series3))
VIF Scores: const 478325.68 Year 1.26 Seats 1.12 Location_Jaipur 1.04 Location_Kolkata 1.05 Location_Pune 1.06 Fuel_Type_Diesel 1.20 Transmission_Manual 1.45 Owner_Type_Second 1.15 Owner_Type_Third 1.10 Car_category_Mid_Range 1.31 Car_category_Luxury_Cars 1.21 Car_category_Ultra_luxury 1.08 dtype: float64
print("Number of rows in train data =", x_train4.shape[0])
# print("Number of rows in train data =", x_test4.shape[0], "\n\n")
# Fit linear model on new dataset and print the summary()
olsmodel4 = build_ols_model(x_train4)
print(olsmodel4.summary())
Number of rows in train data = 5072
OLS Regression Results
==============================================================================
Dep. Variable: price_log R-squared: 0.757
Model: OLS Adj. R-squared: 0.756
Method: Least Squares F-statistic: 1310.
Date: Fri, 20 Aug 2021 Prob (F-statistic): 0.00
Time: 06:55:54 Log-Likelihood: -2455.8
No. Observations: 5072 AIC: 4938.
Df Residuals: 5059 BIC: 5022.
Df Model: 12
Covariance Type: nonrobust
=============================================================================================
coef std err t P>|t| [0.025 0.975]
---------------------------------------------------------------------------------------------
const -139.6884 3.818 -36.584 0.000 -147.174 -132.203
Year 0.0700 0.002 36.948 0.000 0.066 0.074
Seats 0.1005 0.007 13.816 0.000 0.086 0.115
Location_Jaipur -0.1543 0.022 -6.959 0.000 -0.198 -0.111
Location_Kolkata -0.1679 0.020 -8.453 0.000 -0.207 -0.129
Location_Pune -0.1017 0.018 -5.569 0.000 -0.138 -0.066
Fuel_Type_Diesel 0.2352 0.012 19.383 0.000 0.211 0.259
Transmission_Manual -0.4019 0.015 -27.147 0.000 -0.431 -0.373
Owner_Type_Second -0.0538 0.016 -3.334 0.001 -0.085 -0.022
Owner_Type_Third -0.1939 0.040 -4.812 0.000 -0.273 -0.115
Car_category_Mid_Range 0.9782 0.023 42.935 0.000 0.934 1.023
Car_category_Luxury_Cars 1.3994 0.032 43.309 0.000 1.336 1.463
Car_category_Ultra_luxury 1.8896 0.052 36.527 0.000 1.788 1.991
==============================================================================
Omnibus: 44.173 Durbin-Watson: 2.011
Prob(Omnibus): 0.000 Jarque-Bera (JB): 69.903
Skew: -0.027 Prob(JB): 6.62e-16
Kurtosis: 3.573 Cond. No. 1.39e+06
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.39e+06. This might indicate that there are
strong multicollinearity or other numerical problems.
Observations
We have removed multicolinearity.
We can see adj. R-squared for olsmod1 is 0.807 and olsmod4 is 0.799, a reduction after removing Power vs Engine and others and thus olsmod4 will be what we move foward with.
Now the above model has insignificant variables.
Now no feature has p-value greater than 0.05, so we'll consider the features in X_train4 as the final ones and olsres4 as final model.
Shows that the line of best fit is accurately drawn.
# get the residual factors and then the mean
residuals = olsmodel4.resid
np.mean(residuals)
-1.294271065190537e-11
Test - goldfeldquandt test
Homoscedacity: If the variance of the residuals are symmetrically distributed across the regression line, then the data is said to homoscedastic.
Heteroscedacity: If the variance is unequal for the residuals across the regression line, then the data is said to be heteroscedastic. In this case the residuals can form an arrow shape or any other non symmetrical shape.
For goldfeldquandt test, the null and alternate hypotheses are as follows:
alpha = 0.05
# create goldfeldquandt test
name = ["F statistic", "p-value"]
test = sms.het_goldfeldquandt(residuals, x_train3)
lzip(name, test)
[('F statistic', 1.0562479984087707), ('p-value', 0.08516681687499524)]
Since p-value > 0.05 we cannot reject the Null Hypothesis that the residuals are homoscedastic.
Assumptions 3 is also satisfied by our olsmodel3.
Since p-value > 0.05, we can say that the residuals are homoscedastic. This assumption is therefore met in the data.
Why the test?
How to check linearity?
How to fix if this assumption is not followed?
# get the residual factors and then the mean
residual = olsmodel4.resid
fitted = olsmodel4.fittedvalues # predicted values
# create a residual plot
sns.set_style("whitegrid")
sns.residplot(fitted, residual, color="purple", lowess=True)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.title("Residual vs Fitted plot")
plt.show()
Scatter plot shows the distribution of residuals (errors) vs fitted values (predicted values).
If there exist any pattern in this plot, we consider it as signs of non-linearity in the data and a pattern means that the model doesn't capture non-linear effects.
We do see a negative linear pattern mixed with a uniform pattern so we will need to fix this but as I do not know the best way to do this I will continue.
What is the test?
Error terms/Residuals should be normally distributed
If the error terms are non- normally distributed, confidence intervals may become too wide or narrow. Once confidence interval becomes unstable, it leads to difficulty in estimating coefficients based on minimization of least squares.
What do non-normality indicate?
How to Check the Normality?
It can be checked via QQ Plot, Residuals following normal distribution will make a straight line plot otherwise not.
Other test to check for normality : Shapiro-Wilk test.
What is the residuals are not-normal?
# create a histogram of residuals
sns.distplot(residual)
plt.title("Normality of residuals")
plt.show()
The QQ plot of residuals can be used to visually check the normality assumption. The normal probability plot of residuals should approximately follow a straight line.
# create a probability plot
stats.probplot(residual, dist="norm", plot=pylab)
plt.show()
The residuals have a close to normal distribution. Assumption 5 is also satisfied. We should further investigate these values in the tails where we have made huge residual errors.
Now that we have seen that olsmodel3 follows all the linear regression assumptions. Let us use that model to draw inferences.
# get Shapiro test results
stats.shapiro(residual)
ShapiroResult(statistic=0.996060311794281, pvalue=2.0631184260988533e-10)
Now we have checked all the assumptions and they are satisfied, so we can move towards the prediction part.
Let us now make predictions on the test set (X_test) with the model, and compare the actual output values with the predicted values.
# create a new dataframe and read the head
df3 = df2.drop(labels=high_vif_columns and cols_to_drop, axis=1)
df3.head()
# We will have to create the x and y datasets again
ind_vars = df3.drop(["price_log"], axis=1)
dep_var = df3[["price_log"]]
# Dummy encoding
ind_vars_num = encode_cat_vars(ind_vars)
# Splitting data into train and test
x_train5, x_test5, y_train5, y_test5 = train_test_split(
ind_vars_num, dep_var, test_size=0.3, random_state=1
)
# Statsmodel api does not add a constant by default. We need to add it explicitly.
x_train5 = sm.add_constant(x_train5)
# Add constant to test data
x_test5 = sm.add_constant(x_test5)
x_test5.shape
(2174, 13)
x_test5.head()
| const | Year | Seats | Location_Jaipur | Location_Kolkata | Location_Pune | Fuel_Type_Diesel | Transmission_Manual | Owner_Type_Second | Owner_Type_Third | Car_category_Mid_Range | Car_category_Luxury_Cars | Car_category_Ultra_luxury | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 5043 | 1.00 | 2014 | 5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 1.00 | 2014 | 5 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2782 | 1.00 | 2014 | 5 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3659 | 1.00 | 2015 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2869 | 1.00 | 2014 | 5 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
y_test5.shape
(2174, 1)
# create linear model on training data
reg = LinearRegression().fit(x_train5, y_train5)
reg.predict(x_test5) # Predict using the linear model. # predictions on the test set
reg.score(
x_test5, y_test5
) # return R2 coefficient of determination of the prediction. (0-1)
0.7601503856141686
# get df columns
x_train5.columns
Index(['const', 'Year', 'Seats', 'Location_Jaipur', 'Location_Kolkata',
'Location_Pune', 'Fuel_Type_Diesel', 'Transmission_Manual',
'Owner_Type_Second', 'Owner_Type_Third', 'Car_category_Mid_Range',
'Car_category_Luxury_Cars', 'Car_category_Ultra_luxury'],
dtype='object')
# Get the range of the y-variable
cols = [np.exp(y_train5), np.exp(y_test5)]
for i in cols:
display(i.max() - i.min())
price_log 96.63 dtype: float64
price_log 119.55 dtype: float64
# Selecting columns from test data that we used to create our final model
X_test_final = x_test5[x_train5.columns]
# look at the head
X_test_final.head()
| const | Year | Seats | Location_Jaipur | Location_Kolkata | Location_Pune | Fuel_Type_Diesel | Transmission_Manual | Owner_Type_Second | Owner_Type_Third | Car_category_Mid_Range | Car_category_Luxury_Cars | Car_category_Ultra_luxury | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||
| 5043 | 1.00 | 2014 | 5 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 5670 | 1.00 | 2014 | 5 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2782 | 1.00 | 2014 | 5 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
| 3659 | 1.00 | 2015 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2869 | 1.00 | 2014 | 5 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 |
Observations
# Checking model performance on test set (seen 70% data)
print("Test Performance\n")
model_pref(olsmodel4, x_train5, x_test5)
Test Performance
Data RMSE MAE MAPE
0 Train 0.39 0.31 inf
1 Test 0.39 0.30 inf
RMSE:: Increased on training and test from model3 saying that our fit has decreased.
MAE:: Decreased from model3 on training and test saying that our fit has increased. Meaning that our current model is able to predict used cars prices within mean error of .30 lakhs on test data.
MAPE: Has 0's in it so it returns an error.
# let us print the model summary
olsmodel4.summary()
| Dep. Variable: | price_log | R-squared: | 0.757 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | 0.756 |
| Method: | Least Squares | F-statistic: | 1310. |
| Date: | Fri, 20 Aug 2021 | Prob (F-statistic): | 0.00 |
| Time: | 07:08:54 | Log-Likelihood: | -2455.8 |
| No. Observations: | 5072 | AIC: | 4938. |
| Df Residuals: | 5059 | BIC: | 5022. |
| Df Model: | 12 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| const | -139.6884 | 3.818 | -36.584 | 0.000 | -147.174 | -132.203 |
| Year | 0.0700 | 0.002 | 36.948 | 0.000 | 0.066 | 0.074 |
| Seats | 0.1005 | 0.007 | 13.816 | 0.000 | 0.086 | 0.115 |
| Location_Jaipur | -0.1543 | 0.022 | -6.959 | 0.000 | -0.198 | -0.111 |
| Location_Kolkata | -0.1679 | 0.020 | -8.453 | 0.000 | -0.207 | -0.129 |
| Location_Pune | -0.1017 | 0.018 | -5.569 | 0.000 | -0.138 | -0.066 |
| Fuel_Type_Diesel | 0.2352 | 0.012 | 19.383 | 0.000 | 0.211 | 0.259 |
| Transmission_Manual | -0.4019 | 0.015 | -27.147 | 0.000 | -0.431 | -0.373 |
| Owner_Type_Second | -0.0538 | 0.016 | -3.334 | 0.001 | -0.085 | -0.022 |
| Owner_Type_Third | -0.1939 | 0.040 | -4.812 | 0.000 | -0.273 | -0.115 |
| Car_category_Mid_Range | 0.9782 | 0.023 | 42.935 | 0.000 | 0.934 | 1.023 |
| Car_category_Luxury_Cars | 1.3994 | 0.032 | 43.309 | 0.000 | 1.336 | 1.463 |
| Car_category_Ultra_luxury | 1.8896 | 0.052 | 36.527 | 0.000 | 1.788 | 1.991 |
| Omnibus: | 44.173 | Durbin-Watson: | 2.011 |
|---|---|---|---|
| Prob(Omnibus): | 0.000 | Jarque-Bera (JB): | 69.903 |
| Skew: | -0.027 | Prob(JB): | 6.62e-16 |
| Kurtosis: | 3.573 | Cond. No. | 1.39e+06 |
olsres4 is our final model which follows all the assumptions, and can be used for interpretations.
The general linear regression formula is: Y = B0 + B1X1 + B2X2 + e.
Our linear regression formula would be: Price=8.37789660802926E-74(consant) + 3.10588904098151(Car_category_Luxury_Cars)+ 2.25128237881789(Car_category_Mid-Range)+ 1.1304321576378(Fuel_Type_Diesel)+ 1.08784644134162(Year)+ 1.08231254791372(Fuel_Type_LPG)+ 1.04759807906853(Location_Hyderabad)+ 1.00611864288793(Location_Mumbai)+ 1.00040008001067(Engine_num)+ 0.982062821165706(Seats)+ 0.948759440367809(Owner_Type_Second)+ 0.937067463377403(Location_Pune)+ 0.898345985825023(Location_Jaipur)+ 0.856329540247798(Location_Kolkata)+ 0.838282603224233(Owner_Type_Third)+ 0.812207036711939(Transmission_Manual)+ 0.99999975690003(Kilometers_Driven)
It is important to note here that correlation is not equal to causation. That is to say that increase in Mileage does not lead to a drop in prices. It can be understood in such a way that the cars with high mileage do not have a high power engine and therefore have low prices.
The model indicates that the most significant predictors of price of used cars are -
- `Car_category_Ultra_luxury`
- `Car_category_Luxury_Cars`
- `Car_category_Mid-Range`
- `Fuel_Type_Diesel`
- `Year`
The following features have a positive influence on Price:
- `Year`
- `Power`
- `Fuel_Type_Diesel`
The following features have a negative influence on Price:
- `Transmission_Manual`
-`Seats`
-`Owner_Type_Third`
The features of greatest impact are :
- `Transmission_Manual` - in the negative direction.
- `Car_category_Ultra_luxury` - in the positive direction.
It is important to note here that the predicted values are log(x) and therefore coefficients have to be converted accordingly to understand their influence in price.
"Only independent/predictor variable(s) is log-transformed. Divide the coefficient by 100. This tells us that a 1% increase in the independent variable increases (or decreases) the dependent variable by (coefficient/100) units. Example: the coefficient is 0.198. 0.198/100 = 0.00198. For every 1% increase in the independent variable, our dependent variable increases by about 0.002. For x percent increase, multiply the coefficient by log(1.x). Example: For every 10% increase in the independent variable, our dependent variable increases by about 0.198 * log(1.10) = 0.02." Source
Newer cars sell for higher prices. 1 unit increase in the year of manufacture leads to [ exp(0.1145) = 1.12 Lakh ] increase in the price of the vehicle, when everything else is constant.
As the number of seats increases, the price of the car increases - exp(0.0145) = 1.01 Lakhs
Kilometers Driven have a negative relationship with the price which is intuitive. A car that has been driven more will have more wear and tear and hence sell at a lower price, everything else being 0.
The categorical variables are a little hard to interpret. But it can be seen that all the car_category variables in the dataset have a positive relationship with the Price and the magnitude of this positive relationship increases as the brand category moves to the luxury brands. It will not be incorrect to interpret that the dropped car_category variable for budget friendly cars would have a negative relationship with the price because the other 3 are increasingly positive.
Work on getting a dataset with a significant amount of New Car prices to come up with a new model. Develop a strategy to harvest cleaner and clearer datasets on an on-going basis.
Flag any Electric vehicles as high-cost and Hindustan brand as low-cost. Other Fuel_Types have similar effects to eachother and will tend to be lower cost. The other Makes/Brands will be higher costing but all about the same as they have the same effect on Price.
With Automatic and Manuals having similar effects, you can safely disregard distinguishing them when giving recommendations.
When looking to predict used car prices, use the price category data to navigate customers towards the prices they can expect to pay.
Some southern markets tend to have higher prices. It might be a good strategy to plan growth in southern cities using this information. Markets like Kolkata(coeff = -0.15log) are very risky and we need to be careful about investments in this area.
I could have tried log transformations to make the independent more normal instead of the dependent.
I could have removed categorical or numeric types/columns where only one record existed. This has a similar effect as outliers in that one variable can skew the inferences in a direction.
I could have done a PCA feature analysis to reduce the number of input features which would possibily have changed the R^2 value and coefficients.
Analysising predictions where we were way off the mark.
# Extracting the rows from original data frame df where indexes are same as the training data of best model
original_df = df[df.index.isin(x_train5.index.values)].copy()
# Extracting predicted values from the final model
residuals = olsmodel4.resid
fitted_values = olsmodel4.fittedvalues
# Add new columns for predicted values
original_df["Predicted price_log "] = fitted_values
original_df["Predicted Price"] = fitted_values.apply(math.exp)
original_df["residuals"] = residuals
original_df["Abs_residuals"] = residuals.apply(math.exp)
original_df["Difference in Lakhs"] = np.abs(
original_df["price_log"] - original_df["Predicted Price"]
)
# Let us look at the top 20 predictions where our model made highest extimation errors (on train data)
original_df.sort_values(by=["Difference in Lakhs"], ascending=False).head(100)
| Year | Kilometers_Driven | Power | Seats | Engine_num | Mileage_num | price_log | Location_Bangalore | Location_Chennai | Location_Coimbatore | Location_Delhi | Location_Hyderabad | Location_Jaipur | Location_Kochi | Location_Kolkata | Location_Mumbai | Location_Pune | Fuel_Type_Diesel | Fuel_Type_LPG | Fuel_Type_Petrol | Transmission_Manual | Owner_Type_Fourth & Above | Owner_Type_Second | Owner_Type_Third | Brand_Audi | Brand_Bentley | Brand_Bmw | Brand_Chevrolet | Brand_Datsun | Brand_Fiat | Brand_Force | Brand_Ford | Brand_Hindustan | Brand_Honda | Brand_Hyundai | Brand_Isuzu | Brand_Jaguar | Brand_Jeep | Brand_Lamborghini | Brand_Land | Brand_Mahindra | Brand_Maruti | Brand_Mercedes-Benz | Brand_Mini | Brand_Mitsubishi | Brand_Nissan | Brand_Opelcorsa | Brand_Porsche | Brand_Renault | Brand_Skoda | Brand_Smart | Brand_Tata | Brand_Toyota | Brand_Volkswagen | Brand_Volvo | Car_category_Mid_Range | Car_category_Luxury_Cars | Car_category_Ultra_luxury | Predicted price_log | Predicted Price | residuals | Abs_residuals | Difference in Lakhs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S.No. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| 5038 | 2019 | 24013 | 148.31 | 7 | 1999 | 12.81 | 4.03 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.55 | 95.00 | -0.53 | 0.59 | 90.97 |
| 5067 | 2019 | 17201 | 187.70 | 7 | 2179 | 12.51 | 4.08 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.55 | 95.00 | -0.48 | 0.62 | 90.92 |
| 1583 | 2019 | 6773 | 245.00 | 7 | 2967 | 14.75 | 4.29 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.55 | 95.00 | -0.26 | 0.77 | 90.71 |
| 282 | 2018 | 11287 | 241.40 | 7 | 2967 | 12.07 | 4.25 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.48 | 88.57 | -0.23 | 0.79 | 84.32 |
| 327 | 2017 | 97430 | 245.00 | 7 | 2967 | 14.75 | 4.14 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.41 | 82.58 | -0.28 | 0.76 | 78.44 |
| 3705 | 2019 | 29311 | 170.00 | 5 | 2143 | 17.90 | 3.93 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.35 | 77.71 | -0.42 | 0.66 | 73.77 |
| 3043 | 2019 | 25899 | 147.50 | 5 | 2179 | 12.63 | 4.02 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.35 | 77.71 | -0.33 | 0.72 | 73.69 |
| 1505 | 2019 | 26013 | 255.00 | 5 | 2993 | 12.65 | 4.58 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.35 | 77.71 | 0.22 | 1.25 | 73.13 |
| 4345 | 2016 | 45180 | 241.40 | 7 | 2967 | 12.07 | 3.92 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.34 | 77.00 | -0.43 | 0.65 | 73.08 |
| 3051 | 2016 | 16000 | 254.80 | 7 | 2987 | 11.30 | 4.03 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.34 | 77.00 | -0.32 | 0.73 | 72.97 |
| 918 | 2016 | 76000 | 258.00 | 7 | 2993 | 15.97 | 4.06 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.34 | 77.00 | -0.28 | 0.75 | 72.94 |
| 1336 | 2016 | 20002 | 245.00 | 7 | 2967 | 14.75 | 4.20 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.34 | 77.00 | -0.14 | 0.87 | 72.79 |
| 489 | 2017 | 129000 | 258.00 | 7 | 2987 | 11.00 | 4.03 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.31 | 74.60 | -0.29 | 0.75 | 70.57 |
| 5009 | 2017 | 59500 | 245.00 | 7 | 2967 | 14.75 | 4.22 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.31 | 74.60 | -0.09 | 0.91 | 70.38 |
| 2537 | 2016 | 28733 | 241.40 | 7 | 2967 | 12.07 | 3.97 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.29 | 72.96 | -0.32 | 0.73 | 68.99 |
| 866 | 2018 | 56652 | 258.00 | 5 | 2987 | 13.00 | 3.92 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.28 | 72.45 | -0.37 | 0.69 | 68.53 |
| 4003 | 2018 | 3000 | 187.70 | 5 | 2179 | 16.36 | 3.96 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.28 | 72.45 | -0.32 | 0.72 | 68.49 |
| 191 | 2018 | 36091 | 187.70 | 5 | 2179 | 12.70 | 4.02 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.28 | 72.45 | -0.26 | 0.77 | 68.43 |
| 2509 | 2018 | 11000 | 194.00 | 5 | 1950 | 16.10 | 4.03 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.28 | 72.45 | -0.25 | 0.78 | 68.42 |
| 2767 | 2018 | 30706 | 254.79 | 5 | 2987 | 11.57 | 4.08 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.28 | 72.45 | -0.20 | 0.82 | 68.37 |
| 3268 | 2018 | 29277 | 254.79 | 5 | 2987 | 11.57 | 4.09 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.28 | 72.45 | -0.19 | 0.82 | 68.36 |
| 2541 | 2018 | 21603 | 265.00 | 5 | 2987 | 13.00 | 4.25 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.28 | 72.45 | -0.03 | 0.97 | 68.20 |
| 4938 | 2015 | 56238 | 335.20 | 7 | 4134 | 11.33 | 3.95 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.27 | 71.79 | -0.33 | 0.72 | 67.84 |
| 4166 | 2015 | 45000 | 254.80 | 7 | 2987 | 11.30 | 4.03 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.27 | 71.79 | -0.25 | 0.78 | 67.76 |
| 2593 | 2015 | 21000 | 241.40 | 7 | 2967 | 12.07 | 3.96 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.22 | 68.03 | -0.26 | 0.77 | 64.07 |
| 355 | 2017 | 21000 | 177.00 | 5 | 1999 | 19.33 | 3.95 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.21 | 67.55 | -0.26 | 0.77 | 63.60 |
| 404 | 2017 | 34990 | 187.70 | 5 | 2179 | 12.70 | 3.99 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.21 | 67.55 | -0.23 | 0.80 | 63.56 |
| 2111 | 2017 | 25601 | 201.15 | 5 | 2143 | 17.90 | 4.04 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.21 | 67.55 | -0.17 | 0.84 | 63.51 |
| 310 | 2017 | 29000 | 201.15 | 5 | 2143 | 17.90 | 4.04 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.21 | 67.55 | -0.17 | 0.84 | 63.51 |
| 264 | 2017 | 29819 | 254.79 | 5 | 2987 | 11.57 | 4.12 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.21 | 67.55 | -0.10 | 0.91 | 63.43 |
| 1830 | 2017 | 33552 | 254.79 | 5 | 2987 | 11.57 | 4.17 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.21 | 67.55 | -0.04 | 0.96 | 63.38 |
| 1194 | 2017 | 15000 | 265.00 | 5 | 2987 | 13.00 | 4.24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.21 | 67.55 | 0.03 | 1.03 | 63.31 |
| 4614 | 2017 | 15000 | 258.00 | 5 | 2993 | 15.97 | 4.25 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.21 | 67.55 | 0.04 | 1.04 | 63.30 |
| 1093 | 2014 | 50000 | 335.20 | 7 | 4134 | 11.33 | 3.93 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.20 | 66.93 | -0.27 | 0.76 | 63.00 |
| 4360 | 2018 | 20004 | 313.00 | 4 | 2993 | 17.54 | 4.23 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.18 | 65.52 | 0.05 | 1.05 | 61.29 |
| 5740 | 2017 | 21000 | 258.00 | 5 | 2993 | 15.97 | 4.01 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.16 | 64.01 | -0.15 | 0.86 | 60.00 |
| 1496 | 2016 | 17652 | 270.88 | 5 | 2993 | 12.90 | 4.07 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.14 | 62.98 | -0.07 | 0.93 | 58.91 |
| 2547 | 2016 | 25000 | 308.43 | 5 | 2993 | 15.87 | 4.09 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.14 | 62.98 | -0.05 | 0.95 | 58.88 |
| 5341 | 2016 | 65003 | 382.00 | 5 | 4134 | 13.60 | 4.22 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.14 | 62.98 | 0.07 | 1.08 | 58.76 |
| 5582 | 2019 | 28069 | 174.50 | 7 | 2755 | 12.90 | 3.43 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 4.06 | 58.19 | -0.63 | 0.53 | 54.76 |
| 5191 | 2019 | 9000 | 158.00 | 7 | 2198 | 12.62 | 3.49 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 4.06 | 58.19 | -0.57 | 0.57 | 54.70 |
| 2855 | 2019 | 21719 | 174.50 | 7 | 2755 | 12.90 | 3.50 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 4.06 | 58.19 | -0.56 | 0.57 | 54.69 |
| 3196 | 2019 | 9000 | 197.00 | 7 | 3198 | 10.91 | 3.57 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 4.06 | 58.19 | -0.49 | 0.61 | 54.62 |
| 3752 | 2015 | 38467 | 255.00 | 5 | 2993 | 12.65 | 4.26 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.07 | 58.72 | 0.19 | 1.20 | 54.46 |
| 2907 | 2015 | 26000 | 301.73 | 5 | 2993 | 14.47 | 4.32 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.07 | 58.72 | 0.24 | 1.28 | 54.40 |
| 5535 | 2015 | 97003 | 308.43 | 5 | 2993 | 15.87 | 4.44 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.07 | 58.72 | 0.37 | 1.45 | 54.28 |
| 1352 | 2012 | 66538 | 308.00 | 7 | 4367 | 10.20 | 4.25 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.06 | 58.18 | 0.18 | 1.20 | 53.93 |
| 1113 | 2018 | 20446 | 191.34 | 5 | 1950 | 12.60 | 3.94 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.05 | 57.27 | -0.11 | 0.90 | 53.33 |
| 418 | 2018 | 22397 | 367.00 | 5 | 2996 | 11.50 | 4.26 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.05 | 57.27 | 0.21 | 1.24 | 53.00 |
| 1974 | 2018 | 28060 | 320.00 | 5 | 2979 | 12.05 | 4.54 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.05 | 57.27 | 0.49 | 1.64 | 52.73 |
| 2669 | 2018 | 51530 | 197.20 | 7 | 3198 | 10.91 | 3.41 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.99 | 54.25 | -0.58 | 0.56 | 50.84 |
| 4387 | 2014 | 35000 | 258.00 | 5 | 2993 | 16.46 | 3.95 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 4.00 | 54.75 | -0.05 | 0.95 | 50.80 |
| 2145 | 2018 | 14000 | 158.00 | 7 | 2198 | 12.62 | 3.47 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.99 | 54.25 | -0.53 | 0.59 | 50.79 |
| 1430 | 2018 | 30552 | 197.20 | 7 | 3198 | 10.91 | 3.48 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.99 | 54.25 | -0.51 | 0.60 | 50.77 |
| 4906 | 2018 | 15000 | 197.00 | 7 | 3198 | 10.91 | 3.50 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.99 | 54.25 | -0.50 | 0.61 | 50.76 |
| 5278 | 2018 | 12000 | 197.00 | 7 | 3198 | 10.91 | 3.51 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.99 | 54.25 | -0.48 | 0.62 | 50.74 |
| 655 | 2017 | 10501 | 237.40 | 5 | 1999 | 9.43 | 4.32 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3.98 | 53.39 | 0.34 | 1.40 | 49.07 |
| 4185 | 2015 | 54996 | 300.00 | 4 | 2967 | 17.85 | 4.11 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3.97 | 53.11 | 0.14 | 1.15 | 49.00 |
| 1984 | 2017 | 17465 | 320.00 | 5 | 2979 | 12.05 | 4.53 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3.98 | 53.39 | 0.55 | 1.74 | 48.86 |
| 1968 | 2018 | 37713 | 168.50 | 7 | 2982 | 12.55 | 3.45 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 3.94 | 51.41 | -0.49 | 0.62 | 47.96 |
| 4451 | 2018 | 21290 | 395.00 | 4 | 4951 | 13.00 | 4.04 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3.95 | 51.79 | 0.09 | 1.10 | 47.75 |
| 5545 | 2014 | 47000 | 255.00 | 5 | 2993 | 12.65 | 4.17 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3.95 | 51.88 | 0.22 | 1.25 | 47.71 |
| 5441 | 2017 | 54650 | 197.00 | 7 | 3198 | 10.91 | 3.41 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.92 | 50.58 | -0.51 | 0.60 | 47.17 |
| 1724 | 2017 | 38000 | 197.00 | 7 | 3198 | 10.91 | 3.42 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.92 | 50.58 | -0.51 | 0.60 | 47.17 |
| 5323 | 2017 | 17000 | 197.00 | 7 | 3198 | 10.91 | 3.43 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.92 | 50.58 | -0.49 | 0.61 | 47.15 |
| 703 | 2017 | 50484 | 197.20 | 7 | 3198 | 10.91 | 3.44 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.92 | 50.58 | -0.49 | 0.62 | 47.15 |
| 1568 | 2017 | 40000 | 148.31 | 7 | 1999 | 12.81 | 3.73 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.92 | 50.58 | -0.20 | 0.82 | 46.86 |
| 2178 | 2017 | 35000 | 255.00 | 7 | 2993 | 18.00 | 3.73 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.92 | 50.58 | -0.20 | 0.82 | 46.86 |
| 2706 | 2014 | 60033 | 250.00 | 4 | 2967 | 18.18 | 3.94 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3.90 | 49.51 | 0.04 | 1.04 | 45.57 |
| 5158 | 2019 | 20483 | 136.00 | 5 | 2143 | 17.90 | 3.41 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.86 | 47.60 | -0.45 | 0.64 | 44.19 |
| 5416 | 2013 | 75995 | 335.30 | 5 | 4367 | 11.49 | 4.19 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3.88 | 48.37 | 0.31 | 1.36 | 44.18 |
| 3567 | 2019 | 13414 | 136.00 | 5 | 2143 | 17.90 | 3.43 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.86 | 47.60 | -0.44 | 0.65 | 44.17 |
| 3161 | 2019 | 14857 | 113.98 | 5 | 1496 | 20.70 | 3.45 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.86 | 47.60 | -0.41 | 0.66 | 44.15 |
| 4591 | 2019 | 38163 | 174.33 | 5 | 1968 | 17.11 | 3.49 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.86 | 47.60 | -0.37 | 0.69 | 44.10 |
| 5875 | 2019 | 4000 | 194.00 | 5 | 1950 | 18.20 | 3.56 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.86 | 47.60 | -0.31 | 0.74 | 44.04 |
| 5673 | 2019 | 15976 | 136.00 | 5 | 2143 | 17.90 | 3.60 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.86 | 47.60 | -0.26 | 0.77 | 43.99 |
| 4423 | 2019 | 17320 | 170.00 | 5 | 2148 | 15.80 | 3.64 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.86 | 47.60 | -0.23 | 0.80 | 43.96 |
| 2528 | 2016 | 59000 | 255.00 | 7 | 2993 | 18.00 | 3.60 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.85 | 47.16 | -0.25 | 0.78 | 43.56 |
| 1397 | 2016 | 35659 | 258.00 | 7 | 2993 | 15.97 | 3.78 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.85 | 47.16 | -0.07 | 0.93 | 43.38 |
| 3540 | 2016 | 62000 | 187.70 | 7 | 2179 | 12.51 | 3.80 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.85 | 47.16 | -0.06 | 0.94 | 43.37 |
| 459 | 2016 | 51002 | 335.20 | 7 | 4134 | 11.33 | 3.89 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.85 | 47.16 | 0.04 | 1.04 | 43.27 |
| 5603 | 2013 | 36400 | 394.30 | 4 | 4806 | 7.50 | 4.28 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3.83 | 46.16 | 0.44 | 1.56 | 41.89 |
| 3242 | 2012 | 63000 | 335.30 | 5 | 4367 | 11.49 | 4.17 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3.81 | 45.10 | 0.37 | 1.44 | 40.92 |
| 4233 | 2018 | 22060 | 147.51 | 5 | 1968 | 18.51 | 3.49 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | -0.30 | 0.74 | 40.89 |
| 5302 | 2016 | 31075 | 241.40 | 7 | 2967 | 12.07 | 3.91 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.80 | 44.69 | 0.11 | 1.11 | 40.78 |
| 1689 | 2018 | 32586 | 190.00 | 5 | 1995 | 22.48 | 3.62 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | -0.17 | 0.84 | 40.76 |
| 4795 | 2018 | 21936 | 190.00 | 5 | 1995 | 21.76 | 3.64 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | -0.16 | 0.85 | 40.74 |
| 3978 | 2018 | 27467 | 167.62 | 5 | 2143 | 19.27 | 3.66 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | -0.13 | 0.88 | 40.71 |
| 2429 | 2018 | 23152 | 167.62 | 5 | 2143 | 19.27 | 3.67 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | -0.12 | 0.88 | 40.71 |
| 224 | 2018 | 29161 | 167.62 | 5 | 2143 | 19.27 | 3.67 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | -0.12 | 0.89 | 40.70 |
| 2121 | 2018 | 24465 | 167.62 | 5 | 2143 | 19.27 | 3.68 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | -0.12 | 0.89 | 40.70 |
| 772 | 2018 | 36797 | 167.62 | 5 | 2143 | 19.27 | 3.68 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | -0.12 | 0.89 | 40.70 |
| 962 | 2018 | 8682 | 194.00 | 5 | 1950 | 18.20 | 3.68 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | -0.12 | 0.89 | 40.70 |
| 92 | 2018 | 48367 | 174.33 | 5 | 1968 | 17.68 | 3.68 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | -0.11 | 0.89 | 40.70 |
| 2659 | 2015 | 25100 | 300.00 | 4 | 2967 | 17.85 | 4.32 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 3.80 | 44.90 | 0.51 | 1.67 | 40.58 |
| 5477 | 2018 | 52827 | 187.70 | 5 | 2179 | 12.70 | 3.80 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | 0.01 | 1.01 | 40.58 |
| 2924 | 2018 | 41661 | 187.70 | 5 | 2179 | 12.70 | 3.80 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | 0.01 | 1.01 | 40.57 |
| 5927 | 2018 | 29091 | 241.40 | 5 | 2967 | 13.22 | 3.82 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | 0.03 | 1.03 | 40.56 |
| 603 | 2018 | 10003 | 170.00 | 5 | 2143 | 17.90 | 3.88 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.79 | 44.38 | 0.09 | 1.09 | 40.50 |
| 5761 | 2015 | 55662 | 241.40 | 7 | 2967 | 12.07 | 3.76 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 3.78 | 43.97 | -0.03 | 0.97 | 40.21 |
An Ultra Luxury used car, whose used model sells at 56.26 Lakhs and was predicted to be sold at 95 Lakhs. It is not apparent after looking at numerical predictors, why our model predicted such high value here. This could be because all other Ultra Luxury vehicles in our data seem to have sold at higher prices.
There are no instances of false predictions for Mid_Range vehicles.
There are no instances where the model predicts lesser than the actual selling price. These could be a cause for concern. The model predicting higher than potential selling price is not good for business in the used car market where a lower-price point is the main attraction.
Let's take a look at the model's missed predictions for the Ultra Luxury used vehicles.
# scatter plot of predications
sns.scatterplot(
original_df["Difference in Lakhs"],
original_df["price_log"],
hue=original_df["Car_category_Ultra_luxury"],
)
<AxesSubplot:xlabel='Difference in Lakhs', ylabel='price_log'>
Observations
There is a clear grouping between the Mid-Range, Luxury, and Ultra Luxury vehicle categories demarcated by price difference.
Most outliers are the Ultra Luxury models. Our model predicts that resale value of Ultra Luxury cars is higher compared to Mid and Luxury cars. This is probably the cause of these outliers.
Our model seems to be very good at predicting Mid-range, and perhaps therefore Budget as well, vehicle prices.
----------------- END -----------------------------------